ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking
Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang

TL;DR
ReasoningTrack introduces a reasoning-based framework utilizing large pre-trained vision-language models for improved long-term vision-language tracking, along with a new benchmark dataset TNLLT for evaluation.
Contribution
The paper presents a novel reasoning-based tracking framework and a large-scale benchmark dataset, enhancing the performance and understanding of vision-language tracking methods.
Findings
Significant performance improvements on multiple benchmarks.
Effective integration of reasoning and language generation.
Established a new dataset TNLLT for long-term tracking.
Abstract
Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
