Selective State Space Memory for Large Vision-Language Models
Chee Ng, Yuen Fung

TL;DR
This paper presents SSMI, a lightweight, efficient method for fine-tuning large vision-language models by integrating state space modules, achieving state-of-the-art results with reduced computational costs.
Contribution
Introduction of SSMI, a novel state space memory approach that enables efficient, scalable fine-tuning of LVLMs with minimal parameter updates.
Findings
Achieves state-of-the-art performance on benchmark datasets.
Requires significantly fewer parameters to be updated during fine-tuning.
Demonstrates improved efficiency, robustness, and interpretability.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
