Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge
Fangzhou Mu, Sicheng Mo, Gillian Wang, Yin Li

TL;DR
This paper presents a top-performing approach for ego-centric video moment queries, combining ActionFormer backbone with strong video features, achieving state-of-the-art results in the Ego4D challenge.
Contribution
The paper introduces a novel combination of ActionFormer with multiple strong video features for improved temporal action localization in ego-centric videos.
Findings
Ranked 2nd on Ego4D challenge with 21.76% mAP
Nearly three times higher mAP than baseline
Outperforms top solutions in Recall@1x at tIoU=0.5
Abstract
This report describes our submission to the Ego4D Moment Queries Challenge 2022. Our submission builds on ActionFormer, the state-of-the-art backbone for temporal action localization, and a trio of strong video features from SlowFast, Omnivore and EgoVLP. Our solution is ranked 2nd on the public leaderboard with 21.76% average mAP on the test set, which is nearly three times higher than the official baseline. Further, we obtain 42.54% Recall@1x at tIoU=0.5 on the test set, outperforming the top-ranked solution by a significant margin of 1.41 absolute percentage points. Our code is available at https://github.com/happyharrycn/actionformer_release.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsTest
