Self-supervised Contrastive Learning for Audio-Visual Action Recognition

Yang Liu; Ying Tan; Haoyuan Lan

arXiv:2204.13386·cs.CV·March 21, 2023

Self-supervised Contrastive Learning for Audio-Visual Action Recognition

Yang Liu, Ying Tan, Haoyuan Lan

PDF

Open Access

TL;DR

This paper introduces AVCL, a self-supervised framework that leverages audio-visual correlations and novel modules to improve action recognition in unlabeled videos, outperforming existing methods on large-scale datasets.

Contribution

The paper presents a new end-to-end self-supervised learning framework with innovative modules for audio-visual fusion and alignment, along with a new dataset for action recognition.

Findings

01

AVCL outperforms state-of-the-art methods on Kinetics-Sounds32.

02

The proposed modules effectively fuse and align audio-visual data.

03

AVCL demonstrates strong generalization on large-scale datasets.

Abstract

The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods

MethodsALIGN · Contrastive Learning