TL;DR
LaCo is a novel layer-wise visual token compression framework for multimodal large language models that improves efficiency and maintains performance by compressing tokens within the vision encoder layers.
Contribution
LaCo introduces a new layer-wise compression method with pixel-shuffle and residual learning, enabling more effective and efficient token compression during model training and inference.
Findings
Outperforms existing token compression methods in effectiveness.
Improves training efficiency by over 20%.
Enhances inference throughput by over 15%.
Abstract
Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear presentation. The paper demonstrates its method and motivation clearly. 2. Considerable experimental results: When comparing LaCo to other compression methods (Pixel-Shuffle, LDPv2, TokenPacker) all placed at an intermediate layer (the 1/4 layer), LaCo (53.6 Avg) dramatically outperforms the others (all ~36 Avg). 3. Good ablation: The paper provides a clear and useful ablation study on the effect of the compression layer's depth (e.g., 1/12, 1/6, 1/4, 1/2) . This analysis correctly ide
1. Core claim contradiction: The paper's central argument is that internal compression is superior. However, the exp data for the 0.5B models (Tables 4, 5) show this is not the case for performance. When comparing internal (LaCo@1/4) vs. external (LaCo@1) compression, external compression outperforms internal ones in a lot of cases. 2. Short of experiments: The paper only conducts experiments on 0.5B models. Without the evidence that their claims can extend to larger models. 3. Misleading baseli
[+] The manuscript is well written. [+] Experiments are conducted in a series of MLLM benchmarks, following the standard pipeline of LLaVA-OneVision.
[-] Novelty. Many existing works have explored token compression in the visual encoder of MLLM, but these have been largely overlooked in the related work. Overall, the idea presented in this paper is trivial and contribute minor to the community. Token merging: Your vit but faster. Spvit: Enabling faster vision transformers via soft token pruning. Not all patches are what you need: Expediting vision transformers via token reorganizations. FOLDER: Accelerating Multi-modal Large Language Mod
- The proposed method addresses an increasingly relevant challenge: mitigating the visual token bottleneck in MLLMs, which remains a significant computational and practical barrier in scaling such systems. - The design of inserting the Patch Merge Layer (PML) at intermediate encoder depths and supplementing with a residual (space-to-channel + channel averaging) pathway is thoughtfully motivated and directly targets information loss from aggressive lossy compression. - Thorough experimental evalu
1. **Suboptimal Positioning vs. Most Recent Work**: - The Related Work section is incomplete with respect to several directly relevant recent advancements in token pruning and flexible token selection for MLLMs (e.g., LLaVA-Scissor, GlimpsePrune, TransPrune, FlexSelect, GreedyPrune, Beyond Attention/Similarity, explainability-driven compression, dynamic pruning in DynTok, Video Compression Commander, and SmolVLM). None of these are properly cited or contrasted, though they share substantial
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
