How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Huanyu Zhang; Xuehai Bai; Chengzu Li; Chen Liang; Haochen Tian; Haodong Li; Ruichuan An; Yifan Zhang; Anna Korhonen; Zhang Zhang; Liang Wang; Tieniu Tan

arXiv:2602.01851·cs.CV·May 22, 2026

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan

PDF

1 Repo

TL;DR

VIBE introduces a comprehensive benchmark for evaluating how well models follow complex visual instructions in image editing, revealing performance gaps and guiding future improvements.

Contribution

The paper presents VIBE, a new systematic benchmark with a hierarchical interaction framework and a robust evaluation method for visual instruction-driven image editing models.

Findings

01

Proprietary models outperform open-source counterparts.

02

Model performance declines with increasing task complexity.

03

VIBE enables detailed assessment of visual instruction-following capabilities.

Abstract

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hwanyu112/VIBE-Benchmark
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Teaching and Learning Programming · Literacy, Media, and Education