CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

Chengyi Du; Yazhe Niu; Dazhong Shen; Luxin Xu

arXiv:2602.08339·cs.AI·February 10, 2026

CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

Chengyi Du, Yazhe Niu, Dazhong Shen, Luxin Xu

PDF

Open Access

TL;DR

CoTZero introduces an annotation-free, hierarchical reasoning framework for vision-language models that synthesizes structured data and employs cognition-aligned training to improve human-like visual reasoning.

Contribution

It presents a novel dual-stage data synthesis and cognition-aligned training method that enhances VLMs' hierarchical reasoning without requiring annotations.

Findings

01

Achieves 83.33% F1 score on semantic inconsistency benchmark.

02

Improves interpretability and human-aligned reasoning in VLMs.

03

Each component significantly boosts reasoning coherence and generalization.

Abstract

Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning