CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks
Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang

TL;DR
This paper introduces CoT4Det, a framework that reformulates perception tasks into interpretable steps to enhance vision-language models' performance on detection, segmentation, and related tasks, bridging the gap with task-specific models.
Contribution
It proposes a novel Chain-of-Thought approach for perception tasks, significantly improving LVLMs' accuracy on detection and other perception benchmarks.
Findings
Boosts COCO2017 mAP from 19% to 33%.
Outperforms baselines by +2% on RefCOCO.
Achieves 19% improvement on Flickr30k entities.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
