Cross-Lingual Text-Rich Visual Comprehension: An Information Theory   Perspective

Xinmiao Yu; Xiaocheng Feng; Yun Li; Minghui Liao; Ya-Qi Yu; Xiachong; Feng; Weihong Zhong; Ruihan Chen; Mengkang Hu; Jihao Wu; Dandan Tu; Duyu; Tang; Bing Qin

arXiv:2412.17787·cs.CV·December 24, 2024

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

Xinmiao Yu, Xiaocheng Feng, Yun Li, Minghui Liao, Ya-Qi Yu, Xiachong, Feng, Weihong Zhong, Ruihan Chen, Mengkang Hu, Jihao Wu, Dandan Tu, Duyu, Tang, Bing Qin

PDF

Open Access 1 Video

TL;DR

This paper introduces XT-VQA, a benchmark for evaluating cross-lingual text-rich visual question answering, revealing performance gaps in LVLMs and proposing a mutual information-based method to improve cross-lingual visual understanding.

Contribution

The paper presents XT-VQA benchmark for cross-lingual visual reasoning and proposes MVCL-MI, a mutual information maximization approach to enhance LVLMs' cross-lingual visual comprehension.

Findings

01

LVLMs perform poorly on cross-lingual text-rich visual tasks.

02

Mutual information analysis explains the performance gap.

03

MVCL-MI improves cross-lingual visual question answering results.

Abstract

Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective· underline

Taxonomy

TopicsVisual and Cognitive Learning Processes · Advanced Text Analysis Techniques