ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Xuecheng Wu; Jiaxing Liu; Danlei Huang; Yifan Wang; Yunyun Shi; Kedi Chen; Junxiao Xue; Yang Liu; Chunlin Chen; Hairong Dong; Dingkang Yang

arXiv:2505.14404·cs.CV·December 30, 2025

ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Xuecheng Wu, Jiaxing Liu, Danlei Huang, Yifan Wang, Yunyun Shi, Kedi Chen, Junxiao Xue, Yang Liu, Chunlin Chen, Hairong Dong, Dingkang Yang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ViC-Bench, a new benchmark for evaluating multi-modal large language models' reasoning with free-style intermediate visual states, revealing insights into their capabilities and prompting factors.

Contribution

The paper presents ViC-Bench, a comprehensive benchmark with a novel evaluation suite and metrics for assessing VI-CoT in MLLMs using free-style IVS, addressing limitations of fixed IVS benchmarks.

Findings

01

Evaluated 18 advanced MLLMs, revealing varied VI-CoT capabilities.

02

Identified key prompting factors influencing reasoning performance.

03

Provided insights into how free-style IVS impacts model reasoning.

Abstract

Visual-Interleaved Chain-of-Thought (VI-CoT) enables Multi-modal Large Language Models (MLLMs) to continually update their understanding and decision space based on step-wise intermediate visual states (IVS), much like a human would, which has demonstrated impressive success in various tasks, thereby leading to emerged advancements in related downstream benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to the untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks, i.e., maze navigation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

meituan-longcat/ViC-Bench
dataset· 173 dl
173 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling

MethodsJigsaw