Empowering Reliable Visual-Centric Instruction Following in MLLMs

Weilei He; Feng Ju; Zhiyuan Fan; Rui Min; Minhao Cheng; Yi R. Fung

arXiv:2601.03198·cs.LG·January 7, 2026

Empowering Reliable Visual-Centric Instruction Following in MLLMs

Weilei He, Feng Ju, Zhiyuan Fan, Rui Min, Minhao Cheng, Yi R. Fung

PDF

Open Access 2 Datasets

TL;DR

This paper introduces VC-IFEval, a new benchmark for evaluating multimodal large language models' ability to follow visual and textual instructions, addressing limitations of existing text-only benchmarks.

Contribution

The paper presents a novel benchmark and dataset that evaluate MLLMs' instruction-following in multimodal settings, incorporating vision-dependent constraints for more comprehensive assessment.

Findings

01

Fine-tuning improves instruction-following accuracy

02

Benchmark reveals strengths and limitations of current MLLMs

03

Systematic evaluation offers new insights into multimodal alignment

Abstract

Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs' instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs' instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Visual and Cognitive Learning Processes