MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Wayner Barrios; Andr\'es Villa; Juan Le\'on Alc\'azar; SouYoung Jin; Bernard Ghanem

arXiv:2506.01850·cs.CV·June 3, 2025

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Wayner Barrios, Andr\'es Villa, Juan Le\'on Alc\'azar, SouYoung Jin, Bernard Ghanem

PDF

Open Access

TL;DR

MoDA introduces a lightweight modulation adapter that enhances fine-grained visual grounding in instruction-following multimodal large language models by refining visual features with instruction-guided attention.

Contribution

The paper presents MoDA, a novel Transformer-based modulation adapter that improves visual grounding in MLLMs during instruction tuning.

Findings

01

MoDA significantly improves visual grounding accuracy.

02

Enhanced contextual response generation in MLLMs.

03

Effective integration with existing LLaVA training protocol.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle to ground fine-grained visual concepts in complex scenes. In this paper, we propose MoDA (Modulation Adapter), a lightweight yet effective module designed to refine pre-aligned visual features through instruction-guided modulation. Our approach follows the standard LLaVA training protocol, consisting of a two-stage process: (1) aligning image features to the LLMs input space via a frozen vision encoder and adapter layers, and (2) refining those features using the MoDA adapter during the instructional tuning stage. MoDA employs a Transformer-based cross-attention mechanism to generate a modulation mask over the aligned visual tokens, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOnline Learning and Analytics · Intelligent Tutoring Systems and Adaptive Learning

MethodsAdapter