MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen; Shilin Yan; Hongwei Xue; Shuaiqi Lu; Xiaojun Tang; Guannan Zhang; Tiancheng Zhao; Jianwei Yin

arXiv:2603.12266·cs.CV·March 13, 2026

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MM-CondChain, a new benchmark for evaluating deep compositional reasoning in multimodal models, emphasizing multi-layer visual conditions that require detailed perception and complex reasoning.

Contribution

It presents a scalable agentic synthesis pipeline with verifiable layers to construct challenging, multi-domain benchmarks for visual reasoning tasks.

Findings

01

MLLMs achieve only 53.33 Path F1 on the benchmark

02

Performance drops significantly on hard negatives and deeper chains

03

Deep compositional reasoning remains a key challenge for current models

Abstract

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Accio-Lab/MM-CondChain
dataset· 4.1k dl
4.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Explainable Artificial Intelligence (XAI)