A Better Baseline for AVA
Rohit Girdhar, Jo\~ao Carreira, Carl Doersch, Andrew Zisserman

TL;DR
This paper presents a simple yet effective spatiotemporal action localization baseline for AVA, significantly outperforming previous models by leveraging I3D features within a Faster R-CNN framework.
Contribution
The authors introduce a new baseline for AVA action localization using I3D features with Faster R-CNN, achieving state-of-the-art results at CVPR 2018.
Findings
Achieved 21.9% average AP on AVA v2.1 validation set.
Outperformed previous models and challenge submissions.
Demonstrated the effectiveness of I3D features for action localization.
Abstract
We introduce a simple baseline for action localization on the AVA dataset. The model builds upon the Faster R-CNN bounding box detection framework, adapted to operate on pure spatiotemporal features - in our case produced exclusively by an I3D model pretrained on Kinetics. This model obtains 21.9% average AP on the validation set of AVA v2.1, up from 14.5% for the best RGB spatiotemporal model used in the original AVA paper (which was pretrained on Kinetics and ImageNet), and up from 11.3 of the publicly available baseline using a ResNet101 image feature extractor, that was pretrained on ImageNet. Our final model obtains 22.8%/21.9% mAP on the val/test sets and outperforms all submissions to the AVA challenge at CVPR 2018.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsRegion Proposal Network · Softmax · Convolution · RoIPool · Faster R-CNN
