Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images
Krishna Sahukara, Zineddine Bettouche, Andreas Fischer

TL;DR
This paper presents an automated LaTeX-based pipeline for generating synthetic document images with tables, enhancing table detection models like TableNet, and reducing manual annotation efforts.
Contribution
It introduces a novel synthetic data generation method for table detection, improving model performance and benchmarking with automatically created realistic document images.
Findings
TableNet trained on synthetic data achieves low pixel-wise XOR error.
Synthetic data improves performance on the Marmot benchmark.
Automation reduces manual annotation effort.
Abstract
Document pages captured by smartphones or scanners often contain tables, yet manual extraction is slow and error-prone. We introduce an automated LaTeX-based pipeline that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark and enables a systematic resolution study of TableNet. Training TableNet on our synthetic data achieves a pixel-wise XOR error of 4.04% on our synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256), while cutting manual annotation effort through automation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsSparse Evolutionary Training
