TL;DR
This paper introduces a novel framework for video understanding using visual semantic role labeling, along with a large-scale annotated dataset called VidSitu, enabling detailed analysis of events and entities in movies.
Contribution
The paper presents the VidSitu benchmark dataset and a new framework for semantic role labeling in videos, advancing the understanding of complex, diverse movie clips.
Findings
VidSitu contains 29K annotated 10-second clips from movies.
Standard models show room for improvement on semantic role labeling in videos.
Comprehensive analysis highlights challenges and opportunities in video event understanding.
Abstract
We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large-scale video understanding data source with -second movie clips richly annotated with a verb and semantic-roles every seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies () and have been chosen to be both complex ( unique verbs within a video) as well as diverse ( verbs have more than …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
