Hierarchical Self-supervised Representation Learning for Movie Understanding
Fanyi Xiao, Kaustav Kundu, Joseph Tighe, Davide Modolo

TL;DR
This paper introduces a hierarchical self-supervised pretraining method for movie understanding, utilizing different tasks at each level to improve performance on multiple benchmarks and demonstrate the effectiveness of contextualized event features.
Contribution
It proposes a novel hierarchical pretraining strategy with separate objectives for each level, enhancing movie understanding beyond action recognition.
Findings
Improved performance on VidSitu benchmark (e.g., CIDEr score from 47% to 61%)
Effective use of contextualized event features on LVU tasks
Demonstrated complementarity of event and instance features
Abstract
Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model (based on [37]). Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task, which enables the usage of different data sources for pretraining different levels of the hierarchy. We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores). We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
MethodsContrastive Learning
