iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
Lianyu Hu, Liqing Gao, Fanhua Shang, Liang Wan, Wei Feng

TL;DR
iLLaVA introduces a joint optimization and token merging strategy for large multimodal models, significantly reducing input tokens and computational costs while improving performance on vision-language tasks.
Contribution
The paper proposes a novel token merging strategy and joint optimization of the image encoder and LLM, enabling end-to-end acceleration and better efficiency in multimodal models.
Findings
Up to 2x throughput boost in image/video understanding tasks
Achieves 4x reduction in prefilling time
Larger models outperform smaller ones in accuracy and efficiency
Abstract
Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive…
Peer Reviews
Decision·ICLR 2026 Poster
- The method extends token reduction from LLM to image encoders, achieving dual acceleration and significantly reducing overall computational and memory overhead. - It aggregates discarded information with "recycled tokens", maintaining an accuracy of over 95% even at extremely high compression rates, balancing speed and performance.
- Can this method be adapted to flash attention? And, can it be adapted to vllm and sglang inference frameworks? It seems the adaption on mainstream frameworks somehow has difficulties. - The performance drops significantly for tasks that require fine spatial information, such as DocVQA and ChartQA. Small targets or dense text are prone to losing key details due to token merging. - The reduction ratio of tokens and the insertion layer positions need to be manually optimized, lacking an adaptiv
Unlike many previous works in the area that focus on single image task only. This paper presented comprehensive experiments on multi-image and video benchmarks and shows stronger performance compared to several baselines. The experiments with four different VLMs show the effectiveness of the iLLaVA. The results on memory usage, prefilling time and thoughput provide interesting insights on the impact of visual token pruning from different angles, which is often neglected in previous works.
My main concern is on the novelty of this work. The paper claims the two-stage pruning method and token merging as novel contributions. However this idea has been done by previous works. For example, VScan (Zhang et al. 2025) adopt very similar idea to prune tokens at both visual encoder stage and llm decoder stage. Furthermore, VScan also proposed to merge pruned tokens instead of discarding them. The only difference might be the specific layer index where the token pruning/merging happens. Z
- Simple training-free approach. The paper adopts a training-free token merging method for ViT and LLM. Compared with merely merging or pruning tokens on MLLMs, the proposed method achieves better performance, lower memory overhead, and higher throughput. - Extensive experiments. The paper conducts extensive experiments on two benchmark suites to verify the effectiveness of the proposed method across 9 image tasks and 8 video tasks. Under fair comparison settings, the proposed method achieves be
- Lack of novelty. The paper adopts the existing token merging method and extends its application from LLM to ViT. - Simplified visualization in Figure 2. Figure 1 presents the attention scores in ViT for a single object against a simple, plain background. This straightforward case fails to convince readers. What would the attention map look like when the image scene is complex? - Limited experimental benchmarks. The paper only conducts experiments on general QA benchmarks. How does it perform o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
MethodsPruning
