Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers
Qi Feng, Vitaly Ablavsky, Qinxun Bai, Stan Sclaroff

TL;DR
This paper introduces a Siamese Natural Language Tracker (SNLT) that leverages Siamese tracking advancements for tracking objects based on natural language descriptions, achieving high accuracy and real-time performance.
Contribution
The paper presents a novel architecture combining Siamese trackers with natural language processing, establishing new baselines and demonstrating improved tracking performance with NL annotations.
Findings
Improves Siamese trackers by 3-7 percentage points on benchmarks.
Outperforms all existing NL trackers.
Operates at 50 FPS on a single GPU.
Abstract
We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
