TabLib: A Dataset of 627M Tables with Context

Gus Eggert; Kevin Huo; Mike Biven; and Justin Waugh

arXiv:2310.07875·cs.CL·October 13, 2023·2 cites

TabLib: A Dataset of 627M Tables with Context

Gus Eggert, Kevin Huo, Mike Biven, and Justin Waugh

PDF

Open Access 1 Repo 3 Datasets

TL;DR

TabLib is a massive, diverse dataset of 627 million tables with extensive contextual information, aiming to catalyze advancements in AI systems for tabular data.

Contribution

It introduces TabLib, the largest and most diverse dataset of tabular data, extracted from multiple formats and sources, filling a critical gap in AI resources.

Findings

01

Provides a comprehensive, large-scale tabular dataset for AI research.

02

Demonstrates the potential of TabLib to improve AI performance on table data.

03

Establishes a foundation for future development of tabular data models.

Abstract

It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlfoundations/tabliblib
none

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing