Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images

Krishna Sahukara; Zineddine Bettouche; Andreas Fischer

arXiv:2506.14583·cs.CV·June 18, 2025

Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images

Krishna Sahukara, Zineddine Bettouche, Andreas Fischer

PDF

Open Access

TL;DR

This paper presents an automated LaTeX-based pipeline for generating synthetic document images with tables, enhancing table detection models like TableNet, and reducing manual annotation efforts.

Contribution

It introduces a novel synthetic data generation method for table detection, improving model performance and benchmarking with automatically created realistic document images.

Findings

01

TableNet trained on synthetic data achieves low pixel-wise XOR error.

02

Synthetic data improves performance on the Marmot benchmark.

03

Automation reduces manual annotation effort.

Abstract

Document pages captured by smartphones or scanners often contain tables, yet manual extraction is slow and error-prone. We introduce an automated LaTeX-based pipeline that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark and enables a systematic resolution study of TableNet. Training TableNet on our synthetic data achieves a pixel-wise XOR error of 4.04% on our synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256), while cutting manual annotation effort through automation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques

MethodsSparse Evolutionary Training