When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

Yan Shu; Hangui Lin; Yexin Liu; Yan Zhang; Gangyan Zeng; Yan Li; Yu Zhou; Ser-Nam Lim; Harry Yang; Nicu Sebe

arXiv:2506.05551·cs.CV·October 8, 2025

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, Nicu Sebe

PDF

Open Access

TL;DR

This paper identifies causes of semantic hallucinations in large multimodal models during scene text understanding and proposes a training-free framework with a new benchmark to mitigate these hallucinations effectively.

Contribution

The work introduces a novel, training-free mitigation framework and a comprehensive benchmark to address semantic hallucinations in large multimodal models for scene text tasks.

Findings

01

Transformer layers with focused attention reduce hallucinations

02

The proposed method effectively mitigates semantic hallucinations

03

Strong performance on scene text benchmarks

Abstract

Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis