Temporal Action Detection with Multi-level Supervision
Baifeng Shi, Qi Dai, Judy Hoffman, Kate Saenko, Trevor Darrell,, Huijuan Xu

TL;DR
This paper introduces semi-supervised and omni-supervised methods for temporal action detection in videos, leveraging unlabeled and weakly-labeled data to reduce annotation costs and improve detection accuracy.
Contribution
It proposes novel SSAD and OSAD frameworks, along with UFA and IB modules, to effectively utilize different supervision levels and address common detection errors.
Findings
UFA module reduces action incompleteness errors.
IB module mitigates action-context confusion.
OSAD-IB outperforms baselines with limited annotations.
Abstract
Training temporal action detection in videos requires large amounts of labeled data, yet such annotation is expensive to collect. Incorporating unlabeled or weakly-labeled data to train action detection model could help reduce annotation cost. In this work, we first introduce the Semi-supervised Action Detection (SSAD) task with a mixture of labeled and unlabeled data and analyze different types of errors in the proposed SSAD baselines which are directly adapted from the semi-supervised classification task. To alleviate the main error of action incompleteness (i.e., missing parts of actions) in SSAD baselines, we further design an unsupervised foreground attention (UFA) module utilizing the "independence" between foreground and background motion. Then we incorporate weakly-labeled data into SSAD and propose Omni-supervised Action Detection (OSAD) with three levels of supervision. An…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
