Temporal Action Detection with Multi-level Supervision

Baifeng Shi; Qi Dai; Judy Hoffman; Kate Saenko; Trevor Darrell,; Huijuan Xu

arXiv:2011.11893·cs.CV·February 19, 2021

Temporal Action Detection with Multi-level Supervision

Baifeng Shi, Qi Dai, Judy Hoffman, Kate Saenko, Trevor Darrell,, Huijuan Xu

PDF

TL;DR

This paper introduces semi-supervised and omni-supervised methods for temporal action detection in videos, leveraging unlabeled and weakly-labeled data to reduce annotation costs and improve detection accuracy.

Contribution

It proposes novel SSAD and OSAD frameworks, along with UFA and IB modules, to effectively utilize different supervision levels and address common detection errors.

Findings

01

UFA module reduces action incompleteness errors.

02

IB module mitigates action-context confusion.

03

OSAD-IB outperforms baselines with limited annotations.

Abstract

Training temporal action detection in videos requires large amounts of labeled data, yet such annotation is expensive to collect. Incorporating unlabeled or weakly-labeled data to train action detection model could help reduce annotation cost. In this work, we first introduce the Semi-supervised Action Detection (SSAD) task with a mixture of labeled and unlabeled data and analyze different types of errors in the proposed SSAD baselines which are directly adapted from the semi-supervised classification task. To alleviate the main error of action incompleteness (i.e., missing parts of actions) in SSAD baselines, we further design an unsupervised foreground attention (UFA) module utilizing the "independence" between foreground and background motion. Then we incorporate weakly-labeled data into SSAD and propose Omni-supervised Action Detection (OSAD) with three levels of supervision. An…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.