We released GitTables: 1.7M relational tables extracted from CSV files in GitHub. The high industrial interest in Sherlock (a DL model for semantic data type detection) motivated the creation of GitTables.
Through many GitHub issues and discussions with users, we learned that out-of-the-box applicability of Sherlock in industry contexts was low. This was mainly due to the training data, which consisted of tables extracted from HTML pages annotated with 78 semantic types. These tables and semantics poorly generalized to tables beyond the Web. GitTables is meant to close this gap.