LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters   through Modality Linear Representation-Steering

Jinhe Bi; Yujun Wang; Haokun Chen; Xun Xiao; Artur Hecker; Volker; Tresp; Yunpu Ma

arXiv:2412.12359·cs.CV·January 8, 2025

LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker, Tresp, Yunpu Ma

PDF

Open Access 1 Repo

TL;DR

This paper introduces MoReS, a linear transformation method that re-balances visual and textual modalities in multimodal models, drastically reducing trainable parameters while maintaining performance in visual tasks.

Contribution

The paper proposes MoReS, a novel linear representation-steering technique that significantly decreases the number of trainable parameters needed for visual instruction tuning.

Findings

01

MoReS reduces trainable parameters by 500x compared to LoRA.

02

LLaVA Steering models achieve comparable performance with fewer parameters.

03

The platform enables quick customization and evaluation of multimodal models.

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bibisbar/LLaVA-Steering
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Hand Gesture Recognition Systems · Gaze Tracking and Assistive Technology