From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning
Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

TL;DR
This paper introduces Dual-LoRA, a dual-structured adapter framework, and VCE, a local feature aggregation module, to improve efficient visual instruction fine-tuning of multimodal models, especially on complex tasks.
Contribution
It proposes a novel dual-structured adapter and local feature enhancement techniques to better handle data conflicts in efficient visual instruction fine-tuning.
Findings
Dual-LoRA improves task adaptation with minimal additional inference time.
VCE enriches vision-language features with local details.
The approach outperforms existing methods on various benchmarks.
Abstract
Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter's capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge. Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the vision-language projection with local details. Our approach is both memory- and time-efficient, requiring only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Image and Video Quality Assessment · Advanced Vision and Imaging
