Scalable Temporal Localization of Sensitive Activities in Movies and TV Episodes
Xiang Hao, Jingxiang Chen, Shixing Chen, Ahmed Saad, Raffay Hamid

TL;DR
This paper introduces a scalable hierarchical model that leverages weak and sparse labels to accurately localize sensitive activities in long-form videos, significantly outperforming previous methods.
Contribution
The novel Coarse2Fine network effectively combines weak video-level labels with sparse clip-level labels for improved activity localization.
Findings
107.2% relative mAP improvement over state-of-the-art
Largest-scale empirical analysis with 41,234 videos
Effective handling of rare, sensitive content in long videos
Abstract
To help customers make better-informed viewing choices, video-streaming services try to moderate their content and provide more visibility into which portions of their movies and TV episodes contain age-appropriate material (e.g., nudity, sex, violence, or drug-use). Supervised models to localize these sensitive activities require large amounts of clip-level labeled data which is hard to obtain, while weakly-supervised models to this end usually do not offer competitive accuracy. To address this challenge, we propose a novel Coarse2Fine network designed to make use of readily obtainable video-level weak labels in conjunction with sparse clip-level labels of age-appropriate activities. Our model aggregates frame-level predictions to make video-level classifications and is therefore able to leverage sparse clip-level labels along with video-level labels. Furthermore, by performing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTechnology Use by Older Adults · Mobile Health and mHealth Applications · Recommender Systems and Techniques
