Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding

Jinlin Li; Yuran Wang; Yifei Yuan; Xiao Zhou; Yingying Zhang; Xixian Yong; Yefeng Zheng; Xian Wu

arXiv:2510.18321·cs.CV·October 22, 2025

Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding

Jinlin Li, Yuran Wang, Yifei Yuan, Xiao Zhou, Yingying Zhang, Xixian Yong, Yefeng Zheng, Xian Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ATED, a training-free, adaptive token ensemble decoding method that reduces hallucinations in large vision-language models by dynamically aggregating multiple model predictions during inference.

Contribution

The paper presents ATED, a novel ensemble decoding framework that mitigates hallucinations without additional training, improving robustness of LVLMs in multimodal tasks.

Findings

01

ATES significantly reduces hallucinations compared to state-of-the-art methods.

02

The approach maintains fluency and relevance in generated descriptions.

03

Adaptive weighting improves model reliability at each decoding step.

Abstract

Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper introduces a clear and well-motivated idea: ensemble decoding at the token level across multiple LVLMs guided by adaptive uncertainty. This fine-grained approach extends ensemble learning into multimodal generation, which is both innovative and practically relevant. 2. ATED does not require additional training, making it broadly applicable across existing LVLMs and compatible with open-source backbones like LLaVA, InstructBLIP, and MiniGPT-4. 3. The paper includes comparisons on mul

Weaknesses

1. The proposed ATED framework requires simultaneous inference across multiple LVLMs, which substantially increases GPU memory usage and deployment cost. Figure 4 also shows that inference latency can increase up to six times compared to standard decoding. In contrast, other training-free approaches such as VCD typically introduce at most a twofold increase in latency. This raises concerns about ATED’s scalability and practicality in real-world applications where efficiency is critical. 2. The

Reviewer 02Rating 6Confidence 4

Strengths

(1) Training-free, plug-and-play method that leverages existing LVLMs without retraining; works across several backbones. (2) Consistent empirical gains on POPE / CHAIR / MME over strong decoding baselines (VCD, ICD, SID). (3) Ablations + latency knob make the method well-diagnosed and practically tunable.

Weaknesses

(1) The paper compares a multi-model ensemble to single-model baselines; real-world feasibility of running 2–3 LVLMs + perturbations per token is unclear. (2) The paper also lacks comparison to 2025 ED / FastED / iTaD / IFCD-style plug-and-play hallucination mitigators, weakening the “significantly outperforms SOTA” claim.

Reviewer 03Rating 4Confidence 4

Strengths

1. The topic is interesting and tries to address an important problem. 2. The paper writing is easy to follow.

Weaknesses

1. Beyond entropy, can uncertainty be measured with alternative metrics? 2. I’m unclear on the exact decoding procedure. After computing model-specific uncertainty weights, which model (or aggregation) actually drives decoding? Does this operate token-by-token only, or can it decode full sentences? If full sentences, must outputs from all models be forced to match exactly? 3. Please provide efficiency measurements for the entire procedure. 4. I would like deeper analysis—for example, showing

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Misinformation and Its Impacts