LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering
Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker, Tresp, Yunpu Ma

TL;DR
This paper introduces MoReS, a linear transformation method that re-balances visual and textual modalities in multimodal models, drastically reducing trainable parameters while maintaining performance in visual tasks.
Contribution
The paper proposes MoReS, a novel linear representation-steering technique that significantly decreases the number of trainable parameters needed for visual instruction tuning.
Findings
MoReS reduces trainable parameters by 500x compared to LoRA.
LLaVA Steering models achieve comparable performance with fewer parameters.
The platform enables quick customization and evaluation of multimodal models.
Abstract
Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Hand Gesture Recognition Systems · Gaze Tracking and Assistive Technology
