Self-supervised Contrastive Learning for Audio-Visual Action Recognition
Yang Liu, Ying Tan, Haoyuan Lan

TL;DR
This paper introduces AVCL, a self-supervised framework that leverages audio-visual correlations and novel modules to improve action recognition in unlabeled videos, outperforming existing methods on large-scale datasets.
Contribution
The paper presents a new end-to-end self-supervised learning framework with innovative modules for audio-visual fusion and alignment, along with a new dataset for action recognition.
Findings
AVCL outperforms state-of-the-art methods on Kinetics-Sounds32.
The proposed modules effectively fuse and align audio-visual data.
AVCL demonstrates strong generalization on large-scale datasets.
Abstract
The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods
MethodsALIGN · Contrastive Learning
