You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
Okan K\"op\"ukl\"u, Xiangyu Wei, Gerhard Rigoll

TL;DR
YOWO introduces a fast, unified CNN architecture for real-time spatiotemporal action localization in videos, combining spatial and temporal feature extraction in a single stage, achieving state-of-the-art accuracy and high processing speed.
Contribution
The paper presents YOWO, the first single-stage architecture that efficiently combines spatial and temporal information for real-time action localization in videos.
Findings
YOWO achieves 34 fps on 16-frame clips, the fastest among current methods.
YOWO outperforms previous methods on J-HMDB-21 and UCF101-24 datasets.
YOWO provides competitive results on the AVA dataset.
Abstract
Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
