TL;DR
This paper introduces Ensemble Decoding, a novel method that reduces object hallucination in vision-language models by splitting images, combining logits with attention-guided weights, and calibrating plausibility, achieving state-of-the-art results.
Contribution
The paper presents a new ensemble decoding strategy with adaptive plausibility constraints and a fast variant, improving hallucination mitigation in vision-language models.
Findings
Achieves state-of-the-art performance on hallucination benchmarks.
Effectively reduces object hallucination in image captioning and VQA.
Provides a scalable, training-free approach with speed-optimized variant.
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for…
Peer Reviews
Decision·ICLR 2025 Poster
1. The ensemble decoding strategy proposed in this work is well-motivated and sound by zooming into local regions which is more important for decoding at each decoding step. The proposed method also does not require any training. 2. Starting from the baseline model, several optimization techniques have been proposed to improve model accuracy and running efficiency, including adaptive plausibility constraint to only keep the most plausible tokens, and a fast version ED to only focus on one partic
The proposed ensemble encoding strategy, fuses the logits extracted from the original image and multiple local crops. This method makes sense, but introduces much more computation burden for real-world deployment. The FastED only processes one local crop and have some accuracy drops. However, one simpler alternative is to move the ensemble to the input, and concat the multiple images in the prompt. We can imagine that this method is much faster than ED/FastED because only one feed-forward is nee
- The idea of Ensemble Decoding (ED) is interesting, and well motived by evidences of model will get right answer after applying crop and resize to the image and the toy experiment of checking whether properly divided sub-images can reduce object hallucination in the outputs o LVLMs - The authors conducted extensive ablation studies and main experiments to validate the effectiveness of the method, and the experimental results appear to be quite convincing.
- There doesn’t seem to be a deeper explanation for why this method works. It would be better if the authors could provide more insights. How do the degree of cropping and resizing, as well as the attention weights, affect the results? I don’t seem to see any related discussion or analysis on this. - How do the degree of cropping and resizing, as well as the attention weights, affect the results? I don’t seem to see any related discussion or analysis on this.
1. It is interesting to explore the effect of the number of unnecessary objects in an image and the object resolution on the performance of LVLM. 2. The proposed method is simple and straightforward to implement, which enhances its reproducibility. 3. Experiments on multiple benchmarks demonstrate that the proposed method can achieve better performance than state-of-the-art work.
I have some concerns about this paper: 1. From Figure 3, it seems that feeding multiple sub-images into LVLM gives more correct answers than feeding the original image. I'm curious about the performance of ED without using the original image. 2. If LVLM produces the correct output for the original image, but produces the wrong output for the sub-image, does ED negatively affect the understanding of the model in this case? For example, can the split sub-image get a valid output when the target o
Videos
