TL;DR
This paper proposes a two-stage action detection method that enhances viewpoint invariance and temporal consistency in untrimmed videos by using augmented virtual viewpoints and a multi-scale temporal encoder.
Contribution
It introduces a novel training strategy with virtual viewpoint augmentation and a view-invariant temporal encoder for improved action detection.
Findings
Significantly outperforms state-of-the-art methods on PKU-MMD and BABEL benchmarks.
Effectively models fine-grained temporal relationships across motion windows.
Enhances viewpoint invariance in action detection.
Abstract
Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion windows. This paper introduces a novel two-stage action detection approach designed to improve both view-invariance and global temporal coherence properties. In the first stage, we extract motion features from augmented virtual viewpoints, solely used at training. Then, the second stage introduces a new view-invariant, multi-scale temporal encoder based on selective state-space sequence modelling to aggregate information across viewpoints and time scales. Experiments on PKU-MMD and BABEL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
