Towards General Continuous Memory for Vision-Language Models
Wenyi Wu, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang

TL;DR
This paper introduces CoMEM, a novel, efficient continuous memory system for vision-language models that enhances complex reasoning by encoding multimodal and multilingual knowledge into dense embeddings, requiring minimal additional parameters.
Contribution
The paper proposes a compact, dense embedding-based memory system for VLMs, enabling efficient multimodal knowledge encoding without retraining the entire model.
Findings
Improved performance on eight multimodal reasoning benchmarks.
Memory module requires only 1.2% additional parameters.
Effective encoding of multimodal and multilingual knowledge.
Abstract
Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory, a compact set of dense embeddings to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
