TWIX: Automatically Reconstructing Structured Data from Templatized Documents
Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, Aditya G., Parameswaran

TL;DR
TWIX is a novel tool that predicts templates behind templatized documents, enabling more accurate, efficient, and cost-effective data extraction compared to existing methods and vision-based LLMs.
Contribution
The paper introduces TWIX, a method for automatically reconstructing templates from templatized documents to improve data extraction accuracy and efficiency.
Findings
TWIX achieves over 90% precision and recall.
Outperforms industry tools and GPT-4-Vision by over 25%.
Significantly faster and cheaper than vision-based LLMs.
Abstract
Many documents, that we call templatized documents, are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and often require significant human effort, when extracting tables or values given user-specified fields from documents. The key insight of our tool, TWIX, is to predict the underlying template used to create such documents, modeling the visual and structural commonalities across documents. Data extraction based on this predicted template provides a more principled, accurate, and efficient solution at a low cost. Comprehensive evaluations on 34 diverse real-world datasets show that uncovering the template is crucial for data extraction from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
