TL;DR
ReCo is a lightweight, trainable module added to vision-language models that effectively reduces hallucinations by mitigating the fading memory of visual input, improving performance across multiple benchmarks.
Contribution
This paper introduces ReCo, a novel, trainable module that enhances vision-language models by reducing hallucinations without requiring major model modifications.
Findings
ReCo reduces hallucinations in VLMs across multiple benchmarks.
ReCo improves performance when combined with other hallucination mitigation methods.
ReCo is compatible with various VLM architectures.
Abstract
Vision Language Models (VLMs) show impressive capabilities in integrating and reasoning with both visual and language data. But these models make mistakes. A common finding -- similar to LLMs -- is their tendency to hallucinate, i.e., generate plausible sounding text which is not grounded in the visual input, or at worst, is contradictory. A growing consensus attributes this behavior to an over-reliance on language -- especially as the generation progresses, the model suffers from a ``fading memory effect'' with respect to the provided visual input. We study mechanisms by which this behavior can be controlled. Specifically, using ideas from geometric algebra and relational compositions, we propose the addition of a small, trainable module (named ReCo) on top of any VLM -- no other modification is needed. We show that such a lightweight module is able to mitigate the fading memory effect…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
**S1. Efficient.** ReCo requires a low number of parameters and is located on the final hidden representation, resulting in low computational overhead. **S2. Easy to Implement.** The method is straightforward to integrate, requiring only two lines of code. **S3. Effective Across Diverse Hallucination Benchmarks.** The proposed approach demonstrates the effectiveness across multiple benchmark datasets.
**W1. Ambiguity of the black box.** Generally, we cannot access the last hidden layer in black-box models, such as GPT and Claude. With these black-box models, we can obtain only the generated results, namely the text. Thus, the proposed method cannot be used for black box models. **W2. Comparison with training methods.** Hallucination mitigation methods can be categorized into training-based and training-free approaches. The proposed method falls within the training-based approach. The paper
1. The paper provides a clear and intuitive explanation of the fading memory effect, supported by a well-designed visualization in Figure 2. This helps readers quickly grasp the core problem that ReCo aims to solve. 2. The introduction and theoretical background sections are detailed and logically structured, giving readers a solid understanding of the motivation behind Reminder Composition. 3. Experiments are conducted across five diverse benchmarks, demonstrating the general effectiveness
1. In Figure 2, it is unclear which model the attention maps are derived from, and what the corresponding input data and generated tokens are. Clarifying these details would help readers better understand the relationship between visual attention and generated content. 2. While the introduction is well-written, it feels somewhat verbose. The authors might consider streamlining it and improving the logical transitions between paragraphs—for example, the sudden shift to the “Compositionality and
- The paper provides an interpretation of the fading memory effect in VLMs through the lens of Geometric Algebra, offering a clear and conceptually motivated formulation of the proposed ReCo module. - The proposed method demonstrates consistent improvements across evaluated VLMs, supported by both quantitative and qualitative experiments that validate its effectiveness in reducing hallucinations and enhancing visual grounding.
- The proposed method appears conceptually similar to Li et al. (ICML 2025), “The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering.” A more explicit discussion of the similarities and differences between ReCo and this prior work would clarify the paper’s unique contribution. - Continuously re-injecting image features at every decoding step may risk making the visual signal overly dominant, potentially reducing the language model’s cont
- ReCo adds a tiny trainable layer before the prediction head, requiring no changes to the base VLM, minimal training, and negligible deployment overhead. - Demonstrates robust improvements across multiple VLMs (InstructBLIP, LLaVA, MiniGPT-4) and benchmarks (e.g., CHAIR, POPE, AMBER, HallusionBench), indicating broad applicability. - Stacks cleanly with prior decoding/mitigation techniques and yields further gains, supporting practical integration into real systems.
- The images are too —they’re not vector graphics. - The citation format also looks incorrect. - Ablations don’t isolate the “composition” effect,such as gains from W_T vs. W_I, image-token pooling choices, and alternative operators aren’t disentangled. - Most of the benchmarks used in the paper are discriminative. Consider adding generative benchmarks as well—for example, FaithScore: Fine-grained Evaluations of Hallucinations in Large Vision-Language Models. - Motivation is unclear: the pap
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
