Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling
Ze Feng, Jiang-jiang Liu, Sen Yang, Lingyu Xiao, Zhibin Quan, Zhenhua Feng, Wankou Yang, Jingdong Wang

TL;DR
This paper introduces Vision Remember, a method that resamples visual features within LVLMs to recover detailed visual information lost during compression, improving performance on visual understanding tasks.
Contribution
The paper proposes a novel resampling approach with specific attention modules to enhance visual information retention in LVLMs, outperforming existing methods across multiple benchmarks.
Findings
Outperforms TokenPacker, FastV, DeepStack, and SVA Aggregator on benchmarks.
Enhances visual understanding in tasks like OCR and chart analysis.
Demonstrates strong generalization across various LVLM configurations.
Abstract
The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for tasks relying on fine-grained spatial relationships, such as OCR and Chart&Table Understanding. In this paper, we propose to resample original vision features across the LLM decoder layers to recover visual information and attain efficiency. Following this principle, we introduce Vision Remember, which includes two key modules: (1) Token-Feature Cross-Attention Layer and (2) Token Bidirectional Self-Attention Layer. In the Token bidirectional attention, we employ self-attention mechanism to maintain the bidirectional interaction between vision tokens and the text-guided token. In the Token-Feature interaction attention, we introduce local…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The overall writing is relatively clear. 2. Experiments show consistent improvements with better efficiency, demonstrating the effectiveness of the module.
1. The phenomenon shown in Fig. 1b is not particularly new, it has been reported repeatedly (e.g., FastV, PyramidDrop). 2. The performance of “LLaVA-NeXT” is too low; validating an algorithm on it is therefore not very convincing, you can build stronger baselines with more data and new LLM. 3. Are the FastV and PDrop configurations presented in Table 3 their default settings? Please also show the results with no compression applied.
The writing is clear, and the overall presentation is well-structured.
1. The proposed Vision Remember framework appears to be an incremental improvement over existing LVLM architectures, as it primarily involves a relatively straightforward modification of the self-attention and cross-attention mechanisms. The methodological novelty seems limited without deeper architectural innovation. 2. It is unclear whether Table 1 compares Vision Remember and the baseline under identical training conditions. If both are trained on the same dataset, the practical significance
1. Identified two concrete failure modes in efficient LVLMs (i) information bottleneck from projector compression and (ii) visual cue forgetting across decoder layers. This is a useful diagnosis. 2. Resampling vision features mid-decoder is architecturally simple, does not require retraining the whole LVLM from scratch, and can be attached to different visual projectors and different backbones. 3. Ablations are thorough.
1. Training cost is not fully discussed. They retrain with CC-558K + 779K instruction tuning. It’s unclear whether Vision Remember needs full two-phase tuning each time you attach it to a new backbone or projector, or whether it can be added with light finetuning on a smaller set. This matters for “plug-and-play” claims. 2. Some methods are re-trained on additional data instead of original released recipes. This might not be fair to training free methods. 3. In the empirical study, the proposed
- The research direction discussed in the paper (efficiency consequences of augmenting a large decoder model with visual tokens) is valuable and important for practical use cases. - Paper presentation is generally good and easy to follow. - The proposed method is intuitive: it allows visual tokens to access original visual features to tackle the information bottleneck caused by visual token compression. - The preliminary linear probing results shown in Figure 1 are interesting. - The empirical r
- One major shortcoming of all presented results is that accuracy and cost are not discussed together. The proposed method reduces the number of visual tokens but adds extra compute and parameters. Therefore, to fairly compare it with other methods, results should show that for the same cost (for example, TTFT), it achieves better accuracy. For instance, in Table 5a it is shown that adding the proposed VR module on top of Qwen2.5-VL improves performance by 1.7. What happens to TTFT and TPS in th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
