Schema-Driven Information Extraction from Heterogeneous Tables
Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Mark Dredze,, Alan Ritter

TL;DR
This paper investigates the use of large language models for cost-effective, schema-driven extraction of structured data from heterogeneous tables across multiple domains, demonstrating competitive performance without task-specific training.
Contribution
The paper introduces a new schema-driven information extraction task and provides a benchmark for evaluating LLMs on diverse tabular data, highlighting their effectiveness and potential for cost-efficient applications.
Findings
LLMs achieve F1 scores from 74.2 to 96.1 on the benchmark.
Cost-efficient extraction is possible without task-specific pipelines.
Distilling compact models reduces API reliance while maintaining performance.
Abstract
In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Natural Language Processing Techniques
