AnnotatedTables: A Large Tabular Dataset with Language Model Annotations
Yaojie Hu, Ilias Fountalis, Jin Tian, Nikolaos Vasiloglou

TL;DR
This paper introduces AnnotatedTables, a large dataset of over 32,000 databases with LLM-generated annotations, demonstrating the potential of language models to automate tabular data annotation and support diverse research tasks.
Contribution
The paper presents a scalable methodology for annotating large tabular datasets with language models, including SQL and input-target column annotations, and releases the AnnotatedTables dataset.
Findings
LLMs can generate accurate SQL annotations for large datasets.
LLMs can translate SQL to Rel programs with few-shot prompting.
TabPFN classifier performs comparably to AutoML on annotated tables.
Abstract
Tabular data is ubiquitous in real-world applications and abundant on the web, yet its annotation has traditionally required human labor, posing a significant scalability bottleneck for tabular machine learning. Our methodology can successfully annotate a large amount of tabular data and can be flexibly steered to generate various types of annotations based on specific research objectives, as we demonstrate with SQL annotation and input-target column annotation as examples. As a result, we release AnnotatedTables, a collection of 32,119 databases with LLM-generated annotations. The dataset includes 405,616 valid SQL programs, making it the largest SQL dataset with associated tabular data that supports query execution. To further demonstrate the value of our methodology and dataset, we perform two follow-up research studies. 1) We investigate whether LLMs can translate SQL programs to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
Methodstabular data Prior-data Fitted Network
