Look, Listen, and Attend: Co-Attention Network for Self-Supervised   Audio-Visual Representation Learning

Ying Cheng; Ruize Wang; Zhihao Pan; Rui Feng; Yuejie Zhang

arXiv:2008.05789·cs.MM·August 19, 2020

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, Yuejie Zhang

PDF

TL;DR

This paper introduces a self-supervised co-attention network that learns cross-modal audio-visual representations from unlabelled videos, improving performance on synchronization, localization, and recognition tasks by focusing on correlated regions.

Contribution

It proposes a novel co-attention mechanism for self-supervised audio-visual learning, enhancing cross-modal feature extraction and transferability to downstream tasks.

Findings

01

Achieves state-of-the-art results on audio-visual synchronization.

02

Effective in sound source localization and action recognition.

03

Handles scenes with multiple sound sources effectively.

Abstract

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.