Tracking Objects and Activities with Attention for Temporal Sentence   Grounding

Zeyu Xiong; Daizong Liu; Pan Zhou; Jiahao Zhu

arXiv:2302.10813·cs.CV·February 22, 2023

Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Zeyu Xiong, Daizong Liu, Pan Zhou, Jiahao Zhu

PDF

Open Access

TL;DR

This paper introduces TSTNet, a novel approach for temporal sentence grounding that tracks objects and activities to better capture fine-grained spatio-temporal behaviors, outperforming existing methods on benchmark datasets.

Contribution

The paper proposes a new tracking-based framework for TSG, incorporating a cross-modal target generator and a temporal sentence tracker for improved localization accuracy.

Findings

01

Achieves state-of-the-art performance on Charades-STA and TACoS datasets.

02

Runs in real-time with high accuracy.

03

Effectively models subtle spatio-temporal differences.

Abstract

Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media