WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
Junzhe Huang, Xiaoxiao Sun, Yan Yang, Yuxuan Hou, Ruotian Zhang, Sirui Li, Hehe Fan, Serena Yeung-Levy, Xin Yu

TL;DR
WildTableBench is a new benchmark for evaluating multimodal models on real-world table images, highlighting current limitations in structural perception and reasoning.
Contribution
Introduces WildTableBench, the first question-answering benchmark for natural table images, with extensive data and evaluation of state-of-the-art models.
Findings
Only one model exceeds 50% accuracy on the benchmark.
All other models range from 4.1% to 49.9% accuracy.
Models show persistent weaknesses in structural perception and reasoning.
Abstract
Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
