ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

Xiao Wang; Liye Jin; Xufeng Lou; Shiao Wang; Lan Chen; Bo Jiang; Zhipeng Zhang

arXiv:2508.05221·cs.CV·August 8, 2025

ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang

PDF

TL;DR

ReasoningTrack introduces a reasoning-based framework utilizing large pre-trained vision-language models for improved long-term vision-language tracking, along with a new benchmark dataset TNLLT for evaluation.

Contribution

The paper presents a novel reasoning-based tracking framework and a large-scale benchmark dataset, enhancing the performance and understanding of vision-language tracking methods.

Findings

01

Significant performance improvements on multiple benchmarks.

02

Effective integration of reasoning and language generation.

03

Established a new dataset TNLLT for long-term tracking.

Abstract

Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.