Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Yasser Hamidullah, Koel Dutta Chowdhury, Yusser Al Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina Espa\~na-Bonet

TL;DR
This paper introduces a reliability measure based on visual grounding signals to detect hallucinations in sign language translation models, improving robustness and interpretability.
Contribution
It proposes a novel token-level reliability measure combining feature sensitivity and counterfactual signals to identify hallucinations in SLT models.
Findings
Reliability predicts hallucination rates across datasets and models.
Reliability decreases with visual degradation, indicating sensitivity to visual grounding.
Combining reliability with text signals enhances hallucination risk estimation.
Abstract
Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into…
Peer Reviews
Decision·ICLR 2026 Poster
- Precisely identifies hallucination in SLT—especially in gloss-free settings—as a critical, under-addressed issue tied to visual grounding weakness. - The token-level reliability score is architecture-independent, requiring no reference translations, making it broadly applicable across existing and future SLT systems. - Combines encoder sensitivity (input robustness) and counterfactual masking (internal consistency) into a well-justified, interpretable proxy for visual grounding. - Establis
- Lacks testing on larger, more diverse, or continuous SLT benchmarks (e.g., How2Sign, OpenASL). - Tested on a small set of architectures (primarily transformer-based). - Unclear whether the proposed reliability score outperforms or complements prior art.
- S1. This manuscript is well-motivated, addressing the crucial issue of evaluating and improving the utilization of visual information in sign language translation (SLT), a topic that has garnered significant attention in recent years. The manuscript aims to design an indicator to shed light on this challenge. - S2. The manuscript provides a comprehensive set of indicators for assessing the utilization of visual information, considering both the feature space and the output space. - S3. The man
- W1. The definition of hallucination in SLT is unclear. While hallucination in visual-language models (VLMs) is introduced due to the ambiguity in input range and output space, SLT has clearly defined output results and visual-language correspondences. The errors in hallucination can be directly measured using metrics like word error rate (WER) or similar set prediction metrics. Therefore, the introduction of the hallucination concept in SLT seems unnecessary. - W2. As a translation task, many
It is the first work to investigate hallucination in sign language. The authors provide a novel metric to measure the reliability for SLT based on the visual input and text output. They also demonstrate the proposed reliability surpasses text-only baselines in detecting hallucinations.
1.The authors did not provide a clear analysis of the relationships about the three key concepts, i.e., sign language translation performance, the rate of hallucination and the proposed metric. 2.The key steps of the proposed metric is missing, such as how to calculate the weights. The appendix is also not completed, which is confusing. For specific issues, please refer to the descriptions in the Question section. 3.The authors claim that the hallucinations significantly impact the performance
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
