Can MLLMs "Read" What is Missing?
Jindi Guo, Chaozheng Huang, and Xi Fang

TL;DR
This paper introduces MMTR-Bench, a new benchmark for evaluating Multimodal Large Language Models' ability to reconstruct masked text from visual context, focusing on layout understanding and grounding.
Contribution
The paper presents MMTR-Bench, a comprehensive dataset and evaluation protocol to assess MLLMs' intrinsic text reconstruction capabilities from visual inputs.
Findings
MLLMs find sentence- and paragraph-level reconstruction particularly challenging.
The benchmark covers multiple languages and target lengths, revealing limitations in current models.
A level-aware evaluation protocol accounts for diversity in test samples.
Abstract
We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
