Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables?
Ver\'onica B\"acker-Peral, Vitaly Meursault, Christopher Severen

TL;DR
This paper demonstrates that multimodal LLMs can effectively digitize complex historical tables, drastically reducing costs and errors, and producing data comparable to human validation for economic analysis.
Contribution
It introduces a validated LLM-based pipeline for digitizing historical data, significantly lowering costs and errors compared to traditional methods.
Findings
Error rate reduced from 40% to 0.3%
Data matches gold standard with 98.6% R^2
Enables new economic analyses with historical data
Abstract
Multimodal LLMs offer a watershed change for the digitization of historical tables, enabling low-cost processing centered on domain expertise rather than technical skills. We rigorously validate an LLM-based pipeline on a new panel of historical county-level vehicle registrations. This pipeline is 100 times less expensive than outsourcing, reduces critical parsing errors from 40% to 0.3%, and matches human-validated gold standard data with an of 98.6%. Analyses of growth and persistence in vehicle adoption are statistically indistinguishable whether using LLM or gold standard data. LLM-based digitization unlocks complex historical tables, enabling new economic analyses and broader researcher participation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Mobility and Location-Based Analysis · Data-Driven Disease Surveillance · Computational and Text Analysis Methods
