Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg, Mori, Li Fei-Fei

TL;DR
This paper introduces MultiTHUMOS, a densely labeled action dataset for complex videos and proposes a novel LSTM-based model to improve action recognition accuracy and understanding.
Contribution
The paper extends the THUMOS dataset to include dense labels and develops a new LSTM variant for modeling temporal relations in action recognition.
Findings
Dense labeling improves action recognition accuracy.
The proposed LSTM variant effectively models temporal relations.
Enhanced understanding tasks like retrieval and prediction are enabled.
Abstract
Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory (LSTM) deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
