ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model
Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Chenhui, Li, Yang Li, Changbo Wang

TL;DR
ChatTracker leverages multimodal large language models and reflection-based prompt optimization to generate high-quality descriptions, significantly improving visual tracking performance and bridging the gap with state-of-the-art trackers.
Contribution
This paper introduces ChatTracker, a novel framework that uses MLLM for high-quality description generation and iterative refinement to enhance visual tracking accuracy.
Findings
Achieves performance comparable to state-of-the-art trackers.
Effectively refines ambiguous descriptions through feedback loop.
Provides a plug-and-play module to boost existing trackers.
Abstract
Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language~(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSentiment Analysis and Opinion Mining · AI in Service Interactions
