From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Pengkun Jiao; Bin Zhu; Jingjing Chen; Chong-Wah Ngo; Yu-Gang Jiang

arXiv:2411.12787·cs.CV·July 2, 2025

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper introduces Dual-LoRA, a dual-structured adapter framework, and VCE, a local feature aggregation module, to improve efficient visual instruction fine-tuning of multimodal models, especially on complex tasks.

Contribution

It proposes a novel dual-structured adapter and local feature enhancement techniques to better handle data conflicts in efficient visual instruction fine-tuning.

Findings

01

Dual-LoRA improves task adaptation with minimal additional inference time.

02

VCE enriches vision-language features with local details.

03

The approach outperforms existing methods on various benchmarks.

Abstract

Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter's capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge. Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the vision-language projection with local details. Our approach is both memory- and time-efficient, requiring only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Enhancement Techniques · Image and Video Quality Assessment · Advanced Vision and Imaging