TALL: Temporal Activity Localization via Language Query
Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia

TL;DR
This paper introduces TALL, a method for localizing activities in untrimmed videos using natural language queries, addressing the limitations of fixed activity classifiers by enabling flexible, query-based localization.
Contribution
It proposes a novel Cross-modal Temporal Regression Localizer (CTRL) that jointly models text and video for accurate activity localization based on language queries.
Findings
CTRL outperforms previous methods on TaCoS and Charades-STA datasets.
The approach effectively aligns language queries with video segments.
New dataset Charades-STA with sentence annotations was created for evaluation.
Abstract
This paper focuses on temporal localization of actions in untrimmed videos. Existing methods typically train classifiers for a pre-defined list of actions and apply them in a sliding window fashion. However, activities in the wild consist of a wide combination of actors, actions and objects; it is difficult to design a proper activity list that meets users' needs. We propose to localize activities by natural language queries. Temporal Activity Localization via Language (TALL) is challenging as it requires: (1) suitable design of text and video representations to allow cross-modal matching of actions and language queries; (2) ability to locate actions accurately given features from sliding windows of limited granularity. We propose a novel Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips, output alignment scores and action boundary regression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
TALL: Temporal Activity Localization via Language Query· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
