Tables to LaTeX: structure and content extraction from scientific tables
Pratik Kayal, Mrinal Anand, Harsh Desai, Mayank Singh

TL;DR
This paper introduces a transformer-based model that converts scientific tables from images into LaTeX code, effectively capturing complex structures and content, including mathematical symbols, outperforming existing methods.
Contribution
The paper presents a novel transformer-based approach tailored for scientific table extraction, addressing visual and content complexities ignored by prior methods.
Findings
Achieves 70.35% exact match accuracy in table structure extraction.
Achieves 49.69% exact match accuracy in table content extraction.
Efficiently identifies rows, columns, and symbols in scientific tables.
Abstract
Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from tables embedded within PDF research documents is a very challenging task due to the existence of visual features like spanning cells and content features like mathematical symbols and equations. Most existing table structure identification methods tend to ignore these academic writing features. In this paper, we adapt the transformer-based language modeling paradigm for scientific table structure and content extraction. Specifically, the proposed model converts a tabular image to its corresponding LaTeX source code. Overall, we outperform the current state-of-the-art baselines and achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively. Further analysis demonstrates that the proposed models efficiently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
