MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space
Anshul Singh, Chris Biemann, Jan Strich

TL;DR
MTabVQA introduces a new benchmark for evaluating vision-language models on complex multi-tabular visual reasoning tasks, highlighting current limitations and proposing fine-tuning methods to improve performance.
Contribution
The paper presents MTabVQA, a novel benchmark with a large dataset for multi-tabular visual question answering, and demonstrates how instruction-tuning enhances model reasoning capabilities.
Findings
State-of-the-art VLMs perform poorly on multi-tabular reasoning tasks.
Fine-tuning with MTabVQA-Instruct significantly improves reasoning performance.
The benchmark reveals substantial gaps in current models' multi-hop visual reasoning abilities.
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don't assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering to bridge that gap. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
