PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
Fengyuan Sun, Hui Chen, Xinhao Xu, Dandan Zheng, Jingdong Chen, Jun Zhou, Jungong Han, Guiguang Ding

TL;DR
PruneHal is a training-free, adaptive KV cache pruning method that reduces hallucinations in multi-modal large language models by focusing attention on critical visual tokens, with minimal additional computational cost.
Contribution
This paper introduces PruneHal, the first token pruning approach for hallucination mitigation in MLLMs that requires no extra training and is compatible with various decoding strategies.
Findings
Significantly reduces hallucinations across multiple benchmarks
Achieves robust performance with minimal inference overhead
Compatible with different MLLMs and decoding methods
Abstract
While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model's attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. PruneHal is a training-free solution, making it accessible for deployment without additional training costs. 2. PruneHal introduces minimal computational overhead compared to existing methods that require additional inference steps or training. 3. The paper includes informative visual attention plots and latency analysis (Figures 3 and 6), which effectively demonstrate how pruning enhances model attention on critical visual tokens while mitigating hallucinations. 4. The paper presents extensi
1. The paper provides an overview of how adaptive pruning works, but does not fully explain the underlying mechanics that govern the pruning decisions. 2. The method is evaluated on a variety of MLLMs, but there is little discussion on how **PruneHal** would perform on SOTA models like Qwen2.5-VL, Qwen3-VL, or InternVL3.5. 3. A more detailed analysis of failure cases or ablation studies on extreme pruning scenarios would be valuable. 4. Some related and important works are not included in the d
- The method is training-free and highly efficient. As shown in Figure 6, it adds negligible overhead and can even accelerate inference during beam search by reducing the KV cache size. - The idea of using adaptive pruning as a hallucination mitigation technique is novel. The layer-wise voting mechanism to decide when to prune is a smart approach to balance information loss and attention focusing.
- The paper's most critical weakness is its admission that it cannot be evaluated on benchmarks like POPE because "models' responses will keep unchanged". This implies the pruning mechanism only activates after the first token is generated. This is a fundamental design flaw. A hallucination mitigation strategy that cannot influence the first generated token is of very limited use, as it cannot correct a model that is already on a hallucinatory path from the very first token (e.g., answering "Yes
* This paper identifies a noteworthy phenomenon: the allocation of attention influences the generation of hallucinations. * Building on this insight, PruneHal proposes a pruning-based approach to remove some redundant information, thereby reducing hallucinations. It requires no additional training and operates at a relatively fast speed. * In the experiments, the paper not only tests the effectiveness of its own scheme but also verifies that combining this scheme with other existing schemes can
* The models tested in the paper are somewhat outdated. For instance, Qwen-VL, which is tested in the paper, has now been updated to Qwen3-VL, and LLaVA-v1.5 has also been updated to LLaVA-NeXT. It is suggested that more up-to-date models should be used to test the performance of PruneHal. * Why does Algorithm 1 not explain why pruning should also be triggered when m = 2? Is it hypothesized that the authors intend to intervene in visual attention at an early stage to prevent redundant visual tok
The method is practical: no retraining, no architecture change, plug-in compatibility with existing MLLMs. Good empirical coverage across multiple models The topic is significant: hallucination in MLLMs is a major deployment hurdle.
Novelty is limited: prior works have identified the same underlying problem (neglect of visual tokens) and proposed decoding-time mitigations, such as OPERA Narrow experimental scope: the experiments rely mainly on the CHAIR-based benchmark. Other hallucination types (e.g., relation/attribute hallucination, VQA, multi-image or video inputs, it is easy to name many benchmarks such as pope/ hallusionbench/MMhal-bench, etc) are not explored. This limits claims of generality. Interpretability/anal
## 1. Training-free, plug-and-play, and compatible with other methods The method requires no retraining, slots into standard decoding, and is explicitly positioned as compatible with hallucination-aware decoders. The conclusion also emphasizes “virtually no computational overhead.” ## 2. Adaptive, principled trigger that tracks attention drift The design of this paper reacts only when a majority of layers’ visual-token attention falls below a threshold (layer-vote with a ($\sqrt{r}$) cr
## 1. **Correlation, not causation, in the empirical motivation** The key quantitative probe links lower visual attention to hallucination using scatter plots and description, with details deferred to the appendix; This experiment is not considered as a full causal study, as 1) This experiment only shows the performance of llava-v1.5-7b. 2) Since visual-token attention tends to decay over later decoding steps, if hallucinations also occur later, lower attention could partially reflect step in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
