Explain Before You Answer: A Survey on Compositional Visual Reasoning

Fucai Ke; Joy Hsu; Zhixi Cai; Zixian Ma; Xin Zheng; Xindi Wu; Sukai Huang; Weiqing Wang; Pari Delir Haghighi; Gholamreza Haffari; Ranjay Krishna; Jiajun Wu; Hamid Rezatofighi

arXiv:2508.17298·cs.CV·August 28, 2025

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, Hamid Rezatofighi

PDF

TL;DR

This survey comprehensively reviews recent advances in compositional visual reasoning, highlighting architectural paradigms, benchmarks, challenges, and future directions to guide ongoing research in multimodal AI.

Contribution

It provides the first systematic synthesis of compositional visual reasoning literature from 2023 to 2025, including taxonomy, historical roadmap, and critical analysis.

Findings

01

Identified five paradigm shifts in architectural designs.

02

Cataloged 60+ benchmarks and metrics for evaluation.

03

Highlighted open challenges like hallucination and scalable supervision.

Abstract

Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.