Fine-grained Token Allocation Via Operation Pruning for Efficient MLLMs

Aoming Liu; Reuben Tan; Boqing Gong; Bryan A. Plummer

arXiv:2507.02909·cs.LG·November 14, 2025

Fine-grained Token Allocation Via Operation Pruning for Efficient MLLMs

Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a fine-grained operation pruning framework for Multimodal Large Language Models, enabling more efficient token processing by selectively pruning redundant modules while maintaining output quality.

Contribution

It proposes Depth-wise Operation Pruning (DOP), a novel data-driven method that optimizes token allocation across modules with minimal validation runs, achieving state-of-the-art efficiency.

Findings

01

Achieves 86% TFLOPS reduction on LLaVA-Next-7B

02

Reduces latency by 83% with only 1% performance loss

03

Outperforms 12 baselines across 13 benchmarks

Abstract

Token reduction accelerates Multimodal Large Language Models (MLLMs) by reducing excessive tokens, but overlooks structural redundancy differences, where critical and redundant modules process identical token loads. For fine-grained computation control, we define an ``operation" as the computation for a module to process a group of tokens and introduce the operation pruning framework to enable modules to selectively process tokens. Built on this framework, we propose Depth-wise Operation Pruning (DOP), a data-driven method that searches for strategies to prune redundant operations and save computational budget for critical modules to process more tokens than uniform allocation by minimizing divergence from the original model's output probability distribution on a small validation set while satisfying computational constraints. For efficient optimization, DOP applies depth-wise pruning…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* Addresses an important and timely efficiency problem in multimodal LLMs. * The method is simple, easy to integrate, and empirically effective across different architectures. * Extensive experimental coverage with clear ablations and latency analysis.

Weaknesses

* **Incremental novelty:** The idea of pruning at finer granularity is not fundamentally new; the contribution is mostly heuristic refinements (depth ordering + additive scoring). * **Limited theoretical insight:** The additive approximation is empirical, with no formal justification or guarantees; the correlation analysis is weak evidence for correctness. * **Heavy reliance on validation heuristics:** The optimization objective (KL divergence on a small validation set) may not correlate well

Reviewer 02Rating 6Confidence 3

Strengths

1. Through depth-wise pruning constraints and additive approximation, the complex optimization problem becomes highly efficient to solve. 2. The experimental validation is strong. The method is tested on diverse MLLMs (LLaVA, Qwen, InternVL) and 13 benchmarks, with comparisons against recent baselines. Using fixed TFLOPs budgets ensures fair evaluation.

Weaknesses

1. The method relies on two strong simplifying assumptions—monotonic depth redundancy and additive independence across $\mathbf{D_A}$, $\mathbf{D_P}$, and $\mathbf{n_v}$—which may not always hold, as deeper layers are not necessarily more redundant for all tasks. This can limit the method’s ability to reach the global optimum and lead to suboptimal strategies in certain scenarios. 2. The performance of DPO is highly sensitive to the validation set used for optimization, and strategies optimized

Reviewer 03Rating 4Confidence 4

Strengths

The paper tackles the critical problem of MLLM inference efficiency with a conceptually clear and technically sound approach. The shift from "token-level" to "operation-level" pruning is a well-motivated extension of the existing paradigm, offering finer-grained control with clear theoretical and practical value. The optimization strategy—combining depth-wise constraints with additive approximation—is cleverly designed to balance performance and efficiency, making DOP feasible for real-world dep

Weaknesses

1. The paper fails to adequately discuss or compare against recent works that also focus on "operation pruning," notably GSOP [1], Short-LVLM [2], and Skip-Vision [3]. These works similarly aim to accelerate models by pruning internal operations rather than just tokens, yet they are omitted from both the Related Work and experiments. This omission weakens the paper's novelty claim. The authors should clearly articulate the core methodological differences between DOP and GSOP (e.g., DOP's depth-w

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Geographic Information Systems Studies · Constraint Satisfaction and Optimization