Selective State Space Memory for Large Vision-Language Models

Chee Ng; Yuen Fung

arXiv:2412.09875·cs.CV·December 16, 2024

Selective State Space Memory for Large Vision-Language Models

Chee Ng, Yuen Fung

PDF

TL;DR

This paper presents SSMI, a lightweight, efficient method for fine-tuning large vision-language models by integrating state space modules, achieving state-of-the-art results with reduced computational costs.

Contribution

Introduction of SSMI, a novel state space memory approach that enables efficient, scalable fine-tuning of LVLMs with minimal parameter updates.

Findings

01

Achieves state-of-the-art performance on benchmark datasets.

02

Requires significantly fewer parameters to be updated during fine-tuning.

03

Demonstrates improved efficiency, robustness, and interpretability.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.