iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

Hanpeng Liu; Yaqian Li; Zidan Wang; Shuoxi Zhang; Zihao Bo; Rinyoichi Takezoe; Kaiwen Long; Kun He

arXiv:2603.02748·cs.CV·March 10, 2026

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

PDF

Open Access

TL;DR

iGVLM introduces a flexible, instruction-guided visual encoding framework that enhances multimodal understanding by dynamically modulating visual features based on textual instructions, improving reasoning capabilities.

Contribution

The paper presents a novel dual-branch architecture with dynamic feature modulation for instruction-aware visual reasoning in LVLMs.

Findings

01

Improves instruction sensitivity across diverse language models.

02

Enhances logical consistency in multi-query, multi-instruction scenarios.

03

Maintains pre-trained visual priors while enabling task-specific adaptation.

Abstract

Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications