TL;DR
This paper introduces PPE, a novel positional preservation embedding that maintains spatiotemporal structure during token compression in multimodal large language models, leading to improved performance across various vision-language tasks.
Contribution
PPE is a parameter-free, generic encoding operator that explicitly preserves positional information during token merging, enhancing the effectiveness of token compression in MLLMs.
Findings
Achieves 2-5% performance improvements on multiple benchmarks.
Supports cascade clustering for progressive token compression.
Effectively preserves spatial and temporal cues during token reduction.
Abstract
Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering -- a progressive token compression strategy that leads to better…
Peer Reviews
Decision·ICLR 2026 Poster
1. Motivation is clear: positional degradation under token compression. 2. Good solution: leveraging RoPE dimension independence. 3. Strong efficiency gains with competitive accuracy at high compression ratios.
1. Evaluated mainly with clustering (DPC-KNN); no results with learning-based compression. 2. Unclear applicability to ALiBi or other positional schemes. 3. More evaluation is needed, in the paper, layout-heavy or OCR-centric tasks partially explored.
1. The paper clearly identifies that existing token compression methods lose fine-grained positional information, with compelling evidence (Figure 1, attention visualizations) showing how this impacts layout-sensitive tasks. 2. The authors provide thorough ablation studies covering K values, reduction ratios, cascade compression strategies, and include attention visualizations and failure case analyses. 3. PPE supports cascade compression, enabling 90% token reduction while maintaining perform
1. Critically incomplete baseline comparisons: The paper lacks comparisons with several mainstream token compression methods like FastV, VisionZip, MustDrop, and TokenCarve[1-4]. 2. Unvalidated position correspondence assumption: The core assumption that compressed tokens retain a meaningful correspondence to original image positions after vision encoder processing (ViT layers, pooling) is not validated and is questionable for many architectures. This is particularly relevant for models like In
1. **Novel Concept**: PPE preserves the spatio-temporal integrity of visual markers by encoding multiple position identifiers within a single compressed marker. Even under high compression ratios, it significantly reduces computational and memory overhead while maintaining performance. 2. **Parameter-free Compatibility**: The PPE strategy requires no additional training parameters or architectural modifications and can be seamlessly integrated into existing large-scale language models and token
1. **Benchmark Coverage for Layout-Sensitive Tasks**: The current evaluation primarily targets general QA-type tasks (e.g., MMBench, VideoMME). While TextVQA is included, the paper lacks comprehensive validation on benchmarks specifically designed for fine-grained spatial and layout understanding, such as OCR-intensive tasks. It is recommended to incorporate dedicated OCR benchmarks (e.g., OCRBench and DocVQA) to rigorously validate PPE's capability in preserving precise spatial relationships fo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
