CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
Chengyi Du, Yazhe Niu, Dazhong Shen, Luxin Xu

TL;DR
CoTZero introduces an annotation-free, hierarchical reasoning framework for vision-language models that synthesizes structured data and employs cognition-aligned training to improve human-like visual reasoning.
Contribution
It presents a novel dual-stage data synthesis and cognition-aligned training method that enhances VLMs' hierarchical reasoning without requiring annotations.
Findings
Achieves 83.33% F1 score on semantic inconsistency benchmark.
Improves interpretability and human-aligned reasoning in VLMs.
Each component significantly boosts reasoning coherence and generalization.
Abstract
Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
