Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

Ying Liu; Yudong Han; Kean Shi; Liyuan Pan

arXiv:2603.00655·cs.CV·April 15, 2026

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

Ying Liu, Yudong Han, Kean Shi, Liyuan Pan

PDF

1 Repo

TL;DR

Mema is a memory-augmented adapter that enhances vision-language understanding by hierarchically integrating visual cues across layers, improving multimodal reasoning with minimal additional training.

Contribution

Introduces a lightweight, plug-and-play memory module that captures hierarchical visual features and improves multimodal model performance without altering the backbone.

Findings

01

Mema improves performance across multiple benchmarks.

02

The memory mechanism effectively preserves fine-grained visual cues.

03

The approach requires minimal additional training parameters.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Sisiliu312/Mema
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.