Tracking Objects and Activities with Attention for Temporal Sentence Grounding
Zeyu Xiong, Daizong Liu, Pan Zhou, Jiahao Zhu

TL;DR
This paper introduces TSTNet, a novel approach for temporal sentence grounding that tracks objects and activities to better capture fine-grained spatio-temporal behaviors, outperforming existing methods on benchmark datasets.
Contribution
The paper proposes a new tracking-based framework for TSG, incorporating a cross-modal target generator and a temporal sentence tracker for improved localization accuracy.
Findings
Achieves state-of-the-art performance on Charades-STA and TACoS datasets.
Runs in real-time with high accuracy.
Effectively models subtle spatio-temporal differences.
Abstract
Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media
