ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

Yihao Zhen; Qiang Wang; Yu Qiao; Liangqiong Qu; Huijie Fan

arXiv:2507.00454·cs.CV·July 2, 2025

ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

Yihao Zhen, Qiang Wang, Yu Qiao, Liangqiong Qu, Huijie Fan

PDF

4 Reviews

TL;DR

ATSTrack is a novel visual-language tracking method that aligns temporal and spatial scales of inputs, decomposes language descriptions into attribute-based phrases, and uses a Visual-Language token to improve feature relevance, addressing scale mismatch issues.

Contribution

The paper introduces ATSTrack, a new tracker that explicitly aligns temporal and spatial scales of visual and language inputs, enhancing feature modification and tracking accuracy.

Findings

01

Achieves performance comparable to existing methods.

02

Effectively aligns temporal and spatial scales of inputs.

03

Reduces impact of scale differences on tracking accuracy.

Abstract

A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by \textbf{A}ligning \textbf{T}emporal and \textbf{S}patial scale of different input components, named as \textbf{ATSTrack}. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

1. The paper is well structured, and the problem of temporal and spatial misalignment between visual and language modalities is clearly described and intuitively motivated. 2. The method is evaluated on three major visual-language tracking benchmarks (TNL2K, LaSOT, OTB-lang) and shows consistent results.

Weaknesses

1. Insufficient comparison with state-of-the-art methods. Although the paper includes comparisons with several previous works (e.g., CiteTracker, QueryNLT, DUTrack), it omits stronger and more recent SOTA baselines, especially those leveraging multimodal large language models (e.g., ChatTracker[1], and ATCTrack[2]). 2. Insufficient evaluation datasets. More vision-language tracking datasets should be added to evaluate the tracker's performance, including MGIT and LaSOText. 3. Limited novelty. T

Reviewer 02Rating 4Confidence 4

Strengths

The paper introduces an original perspective on vision-language tracking through attribute-specific spatio-temporal alignment, showing a clear methodological novelty. The approach is well-designed, combining attribute decomposition with cross-frame priors for robust feature fusion. Experiments are comprehensive and results are consistent across benchmarks, demonstrating strong technical quality. The paper is clearly written, with well-defined modules and intuitive structure. Overall, it contribu

Weaknesses

1) The experiments focus mainly on short-text benchmarks such as TNL2K and LaSOT, which do not fully capture the advantages of the proposed framework under long-term, evolving, or distractive scenarios. It is recommended to include MGIT experiments to verify the effectiveness of the cross-frame semantic and spatio-temporal alignment mechanisms. 2) The attribute decomposition relies on external rules or LLM-based parsing rather than a learnable design, which may affect reproducibility and cross-

Reviewer 03Rating 2Confidence 4

Strengths

The methodology is described in detail, and the accompanying figures clearly and accurately convey the authors' intent. The model design possesses notable interpretability: by leveraging a Large Language Model (LLM) to segment attributes within the linguistic information, it facilitates the analysis of the roles played by different attributes in the tracking task. The experimental evaluation is comprehensive, encompassing experiments conducted with both HiViT and ViT serving as the backbone ar

Weaknesses

**Lack of Novelty**: The issue of spatio-temporal scale mismatch has already been noted and partially addressed by multiple preceding studies in visual-language tracking. The paper fails to discuss the distinctions between its approach and these existing works. Furthermore, the proposed method appears to be primarily a synthesis of existing literature with minor modifications [1][2]. **Prohibitive Overhead**: The utilization of a Large Language Model (LLM) for natural language processing incurs

Reviewer 04Rating 4Confidence 5

Strengths

1. The issue addressed in this paper, namely the misalignment between text prompts and dynamic visual targets, is a core problem in the visual-language tracking task. 2. This paper includes many illustrative figures, which facilitate the reader’s quick understanding.

Weaknesses

1. The motivation of this paper, which is to align textual cues with dynamic video features, shares many similarities with a recent visual-language tracker, ATCTrack. However, this paper lacks discussion and comparison with ATCTrack (ICCV 2025). 2. A key design of this paper is to divide the text prompt into four parts. What is the basis for this division? Given the flexible and diverse forms of text descriptions, how can this division ensure coverage of all types of texts? 3. The baseline mo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.