HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

Hao Wu; Yingqi Fan; Jinyang Dai; Junlong Tong; Yunpu Ma; Xiaoyu Shen

arXiv:2602.23699·cs.CV·March 2, 2026

HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, Xiaoyu Shen

PDF

Open Access 3 Reviews

TL;DR

HiDrop introduces a hierarchical token reduction framework for multimodal large language models, significantly decreasing computational costs while maintaining performance through innovative pruning, late injection, and dynamic adjustment mechanisms.

Contribution

It presents a novel hierarchical token pruning method with late injection and concave pyramid pruning, optimizing efficiency and accuracy in MLLMs.

Findings

01

Reduces visual tokens by about 90% without performance loss.

02

Speeds up training by 1.72 times.

03

Sets new state-of-the-art in efficient MLLM training and inference.

Abstract

The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The proposed method is effective in reducing a large amount of visual tokens compared to the state-of-the-art methods on extensive benchmarks.

Weaknesses

1. Generalizability of the analysis in Section 2. The analysis in Section 2 serves as the primary motivation for the proposed method. However, the details about the training and evaluation datasets, training configuration, and the exact model type (e.g., the LLM backbones) are missing. It is challenging to convince the reviewer that the empirical observation in Section 2 is generally applicable to various pretrained datasets, evaluation tasks, training configurations, and model types for MLLM. T

Reviewer 02Rating 4Confidence 4

Strengths

1. The complete method is built on close observation of the vision-language fusion process in MLLMs, which not only produces reasonable model structures, but also provides important insights for future vision-language model research. 2. The proposed method achieves a strong performance under low token number regime, retaining 96.5% of the performance using 48 tokens compared to original 576 tokens.

Weaknesses

1. While the identified internal mechanisms are potentially insightful, the presentation and explanation of these mechanisms is of limited quality. Key concepts lack sufficient clarification, making readers incapable of following the reasoning process that helps produce the conclusion. This is especially the case for Figure 2 and 3, (questions detailed in the Questions section below). 2. The analysis seems to be limited to a single type of language model, which lowers the credibility of the ge

Reviewer 03Rating 6Confidence 4

Strengths

1. The finding that visual tokens in most shallow layers are also dispensable is a novel insight that holds significant heuristic value for the VLM research community. 2. The methodology itself is sound; skipping non-essential layers combined with a learnable token selection in the intermediate layers allows for maximal visual token compression, a claim that is substantiated by the experimental results.

Weaknesses

1. The paper lacks a detailed justification for *why* visual tokens in shallow layers are non-essential. This finding serves as a critical premise for the proposed method, yet the supporting argumentation provided is neither detailed nor sufficient. 2. The experimental validation relies on relatively older, and arguably undertrained, VLM models (e.g., LLaVA-1.5). It remains unclear whether the conclusions generalize to more recent, powerful models (e.g., Qwen2.5-VL, Gemma3-VL, Qwen3-VL) and ac

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis