CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Yu Qi; Yumeng Zhang; Chenting Gong; Xiao Tan; Weiming Zhang; Wei Zhang; Jingdong Wang

arXiv:2512.06663·cs.CV·December 9, 2025

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang

PDF

Open Access

TL;DR

This paper introduces CoT4Det, a framework that reformulates perception tasks into interpretable steps to enhance vision-language models' performance on detection, segmentation, and related tasks, bridging the gap with task-specific models.

Contribution

It proposes a novel Chain-of-Thought approach for perception tasks, significantly improving LVLMs' accuracy on detection and other perception benchmarks.

Findings

01

Boosts COCO2017 mAP from 19% to 33%.

02

Outperforms baselines by +2% on RefCOCO.

03

Achieves 19% improvement on Flickr30k entities.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning