Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective
Xinmiao Yu, Xiaocheng Feng, Yun Li, Minghui Liao, Ya-Qi Yu, Xiachong, Feng, Weihong Zhong, Ruihan Chen, Mengkang Hu, Jihao Wu, Dandan Tu, Duyu, Tang, Bing Qin

TL;DR
This paper introduces XT-VQA, a benchmark for evaluating cross-lingual text-rich visual question answering, revealing performance gaps in LVLMs and proposing a mutual information-based method to improve cross-lingual visual understanding.
Contribution
The paper presents XT-VQA benchmark for cross-lingual visual reasoning and proposes MVCL-MI, a mutual information maximization approach to enhance LVLMs' cross-lingual visual comprehension.
Findings
LVLMs perform poorly on cross-lingual text-rich visual tasks.
Mutual information analysis explains the performance gap.
MVCL-MI improves cross-lingual visual question answering results.
Abstract
Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual and Cognitive Learning Processes · Advanced Text Analysis Techniques
