MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

Anshul Singh; Chris Biemann; Jan Strich

arXiv:2506.11684·cs.CV·June 16, 2025

MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

Anshul Singh, Chris Biemann, Jan Strich

PDF

Open Access 2 Datasets 1 Video

TL;DR

MTabVQA introduces a new benchmark for evaluating vision-language models on complex multi-tabular visual reasoning tasks, highlighting current limitations and proposing fine-tuning methods to improve performance.

Contribution

The paper presents MTabVQA, a novel benchmark with a large dataset for multi-tabular visual question answering, and demonstrates how instruction-tuning enhances model reasoning capabilities.

Findings

01

State-of-the-art VLMs perform poorly on multi-tabular reasoning tasks.

02

Fine-tuning with MTabVQA-Instruct significantly improves reasoning performance.

03

The benchmark reveals substantial gaps in current models' multi-hop visual reasoning abilities.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don't assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering to bridge that gap. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques