AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual   Masked Autoencoder

Xingjian Diao; Ming Cheng; and Shitong Cheng

arXiv:2309.08738·cs.CV·December 22, 2023·1 cites

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

Xingjian Diao, Ming Cheng, and Shitong Cheng

PDF

Open Access

TL;DR

AV-MaskEnhancer introduces a novel audio-visual masked autoencoder that leverages the complementary nature of audio and video features to improve high-quality video representation, especially in challenging scenarios like low-resolution videos.

Contribution

The paper proposes AV-MaskEnhancer, a new method combining audio and visual data in masked autoencoders to enhance video representations beyond visual-only approaches.

Findings

01

Achieves state-of-the-art top-1 accuracy of 98.8% on UCF101

02

Outperforms previous models in low-resolution and blurry video scenarios

03

Demonstrates the effectiveness of cross-modality learning in video representation

Abstract

Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Digital Media Forensic Detection