Enhancing Multi-Image Understanding through Delimiter Token Scaling
Minyoung Lee, Yeji Park, Dongjun Hwang, Yejin Kim, Seong Joon Oh, Junsuk Choe

TL;DR
This paper introduces a simple yet effective method of scaling delimiter token hidden states in large vision-language models to improve multi-image understanding without extra training or inference costs.
Contribution
The paper proposes a novel delimiter token scaling technique that enhances intra-image information preservation and reduces cross-image leakage in LVLMs.
Findings
Performance improved on multi-image benchmarks like Mantis and MIRB.
Enhanced multi-document and multi-table understanding in text-only tasks.
Method requires no additional training or inference overhead.
Abstract
Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on…
Peer Reviews
Decision·ICLR 2026 Poster
The interleaved image–text processing still lacks clarity in terms of how information flows across modalities. The explanation within MLLMs remains underexplored, though this work makes a valuable attempt to highlight the importance of key regions. The motivation is clear. The core idea is **very simple yet interesting**, and the method section is clearly presented. The method do not bring additional training cost and just reweight the hidden states.
1. The main concern is the **limited scope of evaluation**. The paper focuses primarily on math and multi-view benchmarks, whereas multi-image input represents a special case of __interleaved data__ that can be applied to a broader range of scenarios. The performance under few-shot settings, where multiple instances are concatenated together, remains unclear and differs from the explored benchmarks. 2. The performance improvements are sometimes marginal, suggesting limited generalization. 3. T
1. The problem of cross-image information leakage in multi-image LVLM settings is important and worth investigating. 2. The prior analysis on delimiter tokens and the characterization of their key properties is clear and insightful, helping readers better understand the mechanism. 3. The experiments are extensive, covering multiple LVLM families and sizes, four multi-image benchmarks, two multi-document benchmarks, and one multi-table benchmark.
1. The concept of “sink tokens” has been studied in prior works, and there are also existing methods addressing cross-image leakage. Thus, the novelty and significance of the current findings appear limited. 2. The technical contribution of the proposed method is relatively weak. Scaling the hidden states of delimiter tokens provides only marginal performance gains. For instance, when applied to larger LVLMs such as InternVL3-14B or Qwen2.5-VL-32B, the improvements are minimal (e.g., 42.42 → 42.
- Solves cross-image leakage in LVLMs via delimiter scaling, with gains in multi-image/text tasks, no extra cost. - Analyzes delimiter tokens’ key properties, offering clear theoretical basis for the method. - Generalizes to text multi-instance tasks, works across models (0.5B–32B), and fits optimized kernels.
- Though claiming minimal impact on text-image interaction, it only mentions a 10% drop in text-to-image attention scores without detailing how this drop affects downstream cross-modal tasks (e.g., image-text retrieval), leaving uncertainty about real-world cross-modal performance
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
