Knowing Your Target: Target-Aware Transformer Makes Better   Spatio-Temporal Video Grounding

Xin Gu; Yaojie Shen; Chenxi Luo; Tiejian Luo; Yan Huang; Yuewei Lin,; Heng Fan; Libo Zhang

arXiv:2502.11168·cs.CV·February 18, 2025

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, Yuewei Lin,, Heng Fan, Libo Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a Target-Aware Transformer for spatio-temporal video grounding that generates target-specific object queries from video-text pairs, leading to improved localization accuracy in complex scenarios.

Contribution

The paper proposes a novel method to generate adaptive, target-specific object queries using text-guided temporal sampling and attribute-aware spatial activation, enhancing STVG performance.

Findings

01

Achieves state-of-the-art results on three benchmarks.

02

Significantly outperforms baseline methods.

03

Demonstrates robustness in complex scenarios with distractors or occlusion.

Abstract

Transformer has attracted increasing interest in STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (\e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HengLan/TA-STVG
pytorchOfficial

Videos

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Advanced Image Processing Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Residual Connection · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax