Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan, Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, Rongrong Ji

TL;DR
Cantor introduces a novel multimodal chain-of-thought framework for large language models that enhances visual reasoning by integrating visual analysis and cognitive functions, leading to significant performance improvements without additional fine-tuning.
Contribution
The paper proposes a perception-decision architecture for multimodal CoT, enabling better visual reasoning in MLLMs through integrated visual analysis and higher-level cognitive functions.
Findings
Significant performance improvements on complex visual reasoning datasets.
Effective without fine-tuning or ground-truth rationales.
Demonstrates the importance of converging visual context and reasoning.
Abstract
With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Speech and dialogue systems · Language, Metaphor, and Cognition
