ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022

Jiayi Shao; Xiaohan Wang; Yi Yang

arXiv:2211.09558·cs.CV·September 26, 2023·1 cites

ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022

Jiayi Shao, Xiaohan Wang, Yi Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a transformer-based method with segment-level recurrence for localizing activities in long egocentric videos, achieving top performance in the Ego4D Moment Queries Challenge 2022.

Contribution

It proposes a novel segment-level recurrence mechanism to improve long-term dependency modeling in transformer-based temporal action localization.

Findings

01

Achieved Recall@1,tIoU=0.5 of 37.24

02

Attained average mAP of 17.67

03

Secured 3rd place in the challenge leaderboard

Abstract

In this report, we present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2022. In this task, the goal is to retrieve and localize all instances of possible activities in egocentric videos. Ego4D dataset is challenging for the temporal action localization task as the temporal duration of the videos is quite long and each video contains multiple action instances with fine-grained action classes. To address these problems, we utilize a multi-scale transformer to classify different action categories and predict the boundary of each instance. Moreover, in order to better capture the long-term temporal dependencies in the long videos, we propose a segment-level recurrence mechanism. Compared with directly feeding all video features to the transformer encoder, the proposed segment-level recurrence mechanism alleviates the optimization difficulties and achieves better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JonnyS1226/Ego4d_mq_3rd_solution
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization