Exploring adaptation of VideoMAE for Audio-Visual Diarization & Social @ Ego4d Looking at me Challenge
Yinan He, Guo Chen

TL;DR
This paper adapts a pretrained video mask autoencoder (VideoMAE) for egocentric audio-visual diarization tasks, demonstrating effective transfer learning with minimal training data to improve performance on the Ego4d Looking at me Challenge.
Contribution
The study shows that VideoMAE pretrained on third-person videos can be effectively transferred to egocentric tasks with limited training, enhancing performance in social and diarization challenges.
Findings
Transferred VideoMAE representations capture small actions effectively.
Achieved better results than baseline with only 10 epochs of egocentric data.
Demonstrated successful adaptation of video pretraining for egocentric audio-visual tasks.
Abstract
In this report, we present the transferring pretrained video mask autoencoders(VideoMAE) to egocentric tasks for Ego4d Looking at me Challenge. VideoMAE is the data-efficient pretraining model for self-supervised video pre-training and can easily transfer to downstream tasks. We show that the representation transferred from VideoMAE has good Spatio-temporal modeling and the ability to capture small actions. We only need to use egocentric data to train 10 epochs based on VideoMAE which pretrained by the ordinary videos acquired from a third person's view, and we can get better results than the baseline on Ego4d Looking at me Challenge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
