AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder
Xingjian Diao, Ming Cheng, and Shitong Cheng

TL;DR
AV-MaskEnhancer introduces a novel audio-visual masked autoencoder that leverages the complementary nature of audio and video features to improve high-quality video representation, especially in challenging scenarios like low-resolution videos.
Contribution
The paper proposes AV-MaskEnhancer, a new method combining audio and visual data in masked autoencoders to enhance video representations beyond visual-only approaches.
Findings
Achieves state-of-the-art top-1 accuracy of 98.8% on UCF101
Outperforms previous models in low-resolution and blurry video scenarios
Demonstrates the effectiveness of cross-modality learning in video representation
Abstract
Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Digital Media Forensic Detection
