TL;DR
This paper introduces a new dataset and a duration-informed method for localizing narrated actions in lifestyle vlogs, improving accuracy by leveraging expected action durations and multimodal analysis.
Contribution
The paper presents a novel dataset of 13,000 annotated actions in vlogs and a simple duration-based approach that enhances temporal action localization performance.
Findings
The proposed method improves localization accuracy over previous approaches.
Multimodal analysis reveals interactions between language and visuals in vlogs.
Duration information provides complementary cues for action localization.
Abstract
We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods, and leads to improvements over previous work for the task of temporal action localization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
