Connectionist Temporal Modeling for Weakly Supervised Action Labeling
De-An Huang, Li Fei-Fei, Juan Carlos Niebles

TL;DR
This paper introduces ECTC, a weakly-supervised learning framework for action labeling in videos that leverages action order information and visual similarity to improve alignment without detailed frame annotations.
Contribution
The paper presents the Extended Connectionist Temporal Classification (ECTC) framework, enabling efficient alignment evaluation and semi-supervised learning for action labeling with minimal supervision.
Findings
ECTC outperforms existing semi-supervised methods with less than 1% labeled frames.
The framework achieves comparable performance to fully supervised approaches.
Explicit alignment enforcement improves weakly-supervised action labeling.
Abstract
We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time. The key challenge is that the per-frame alignments between the input (video) and label (action) sequences are unknown during training. We address this by introducing the Extended Connectionist Temporal Classification (ECTC) framework to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities. This protects the model from distractions of visually inconsistent or degenerated alignments without the need of temporal supervision. We further extend our framework to the semi-supervised case when a few frames are sparsely annotated in a video. With less than 1% of labeled frames per video, our method is able to outperform existing semi-supervised approaches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
