TL;DR
INDOTABVQA is a new benchmark dataset for evaluating cross-lingual table understanding in Bahasa Indonesia documents, highlighting performance gaps in current models and the benefits of targeted fine-tuning.
Contribution
The paper introduces INDOTABVQA, a comprehensive dataset for cross-lingual table VQA in Bahasa Indonesia, and benchmarks multiple models to reveal performance gaps and improvements.
Findings
Significant performance gaps in current VLMs on complex tables and low-resource languages.
Fine-tuning models on INDOTABVQA improves accuracy by up to 17.8%.
Adding explicit table region coordinates enhances model performance by 4-7%.
Abstract
We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
