Learning to Track Instance from Single Nature Language Description

Yaozong Zheng; Bineng Zhong; Qihua Liang; Shuimu Zeng; Haiying Xia; Shuxiang Song

arXiv:2605.07064·cs.CV·May 11, 2026

Learning to Track Instance from Single Nature Language Description

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shuimu Zeng, Haiying Xia, Shuxiang Song

PDF

TL;DR

This paper introduces { racker}, a self-supervised vision-language tracker that learns to track objects from natural language descriptions without bounding-box annotations, using a novel dynamic token aggregation approach.

Contribution

It proposes a new self-supervised VL tracking method with a dynamic token aggregation module that improves semantic alignment and tracking performance without labeled data.

Findings

01

{ racker} surpasses state-of-the-art self-supervised methods on VL tracking benchmarks.

02

The dynamic token aggregation enhances semantic alignment between language and visual tokens.

03

The method enables effective instance tracking from unlabeled videos without bounding box annotations.

Abstract

How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.