ChatTracker: Enhancing Visual Tracking Performance via Chatting with   Multimodal Large Language Model

Yiming Sun; Fan Yu; Shaoxiang Chen; Yu Zhang; Junwei Huang; Chenhui; Li; Yang Li; Changbo Wang

arXiv:2411.01756·cs.CV·December 17, 2024·3 cites

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Chenhui, Li, Yang Li, Changbo Wang

PDF

Open Access 1 Video

TL;DR

ChatTracker leverages multimodal large language models and reflection-based prompt optimization to generate high-quality descriptions, significantly improving visual tracking performance and bridging the gap with state-of-the-art trackers.

Contribution

This paper introduces ChatTracker, a novel framework that uses MLLM for high-quality description generation and iterative refinement to enhance visual tracking accuracy.

Findings

01

Achieves performance comparable to state-of-the-art trackers.

02

Effectively refines ambiguous descriptions through feedback loop.

03

Provides a plug-and-play module to boost existing trackers.

Abstract

Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language~(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model· slideslive

Taxonomy

TopicsSentiment Analysis and Opinion Mining · AI in Service Interactions