TL;DR
GitTables is a large, curated corpus of 1 million relational tables from GitHub, designed to improve deep learning models for relational table tasks beyond web data, with annotations and applications demonstrated.
Contribution
Introduction of GitTables, a large-scale, diverse relational table corpus with semantic annotations, enabling advanced table understanding and applications.
Findings
GitTables differs significantly from existing corpora in structure and content.
Semantic annotation pipeline achieves human-level accuracy.
Applications demonstrate improved table-to-knowledge graph matching and schema completion.
Abstract
The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
