Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

Zhentao He; Can Zhang; Ziheng Wu; Zhenghao Chen; Yufei Zhan; Yifan Li; Zhao Zhang; Xian Wang; Minghui Qiu

arXiv:2506.20168·cs.CV·September 23, 2025

Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

Zhentao He, Can Zhang, Ziheng Wu, Zhenghao Chen, Yufei Zhan, Yifan Li, Zhao Zhang, Xian Wang, Minghui Qiu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new benchmark and a framework to evaluate and reduce OCR hallucinations in multimodal large language models, especially under degraded visual conditions, improving document understanding accuracy.

Contribution

The paper presents KIE-HVQA, the first benchmark for OCR hallucination evaluation, and a GRPO-based framework with a reward mechanism to mitigate hallucinations in degraded document understanding.

Findings

01

22% improvement in hallucination-free accuracy on KIE-HVQA

02

No significant performance drop on standard tasks

03

Effective mitigation of hallucinations in ambiguous regions

Abstract

Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading to overreliance on linguistic priors or misaligned visual-textual reasoning. This difficulty in recognizing uncertainty frequently results in the generation of hallucinatory content, especially when a precise answer is not feasible. To better demonstrate and analyze this phenomenon and problem, we propose KIE-HVQA, the first benchmark dedicated to evaluating OCR hallucination in degraded document understanding. This dataset includes test samples spanning identity cards and invoices, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

bytedance-research/KIE-HVQA
dataset· 134 dl
134 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPsychosomatic Disorders and Their Treatments · Clinical Reasoning and Diagnostic Skills