TL;DR
Mema is a memory-augmented adapter that enhances vision-language understanding by hierarchically integrating visual cues across layers, improving multimodal reasoning with minimal additional training.
Contribution
Introduces a lightweight, plug-and-play memory module that captures hierarchical visual features and improves multimodal model performance without altering the backbone.
Findings
Mema improves performance across multiple benchmarks.
The memory mechanism effectively preserves fine-grained visual cues.
The approach requires minimal additional training parameters.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
