Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Shaobo Min; Qi Dai; Hongtao Xie; Chuang Gan; Yongdong Zhang; Jingdong; Wang

arXiv:2106.06939·cs.CV·June 15, 2021·5 cites

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Shaobo Min, Qi Dai, Hongtao Xie, Chuang Gan, Yongdong Zhang, Jingdong, Wang

PDF

Open Access

TL;DR

This paper introduces Cross-Modal Attention Consistency (CMAC), a novel unsupervised learning method that aligns visual and audio regional attention to improve video-audio representation learning.

Contribution

It proposes a new pretext task, CMAC, that enforces bidirectional local correspondence between visual and audio signals within a contrastive learning framework.

Findings

01

Improves state-of-the-art performance on multiple benchmarks.

02

Effectively aligns visual and audio regional attention.

03

Enhances unsupervised video-audio representation learning.

Abstract

Cross-modal correlation provides an inherent supervision for video unsupervised representation learning. Existing methods focus on distinguishing different video clips by visual and audio representations. We human visual perception could attend to regions where sounds are made, and our auditory perception could also ground their frequencies of sounding objects, which we call bidirectional local correspondence. Such supervision is intuitive but not well explored in the contrastive learning framework. This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property. The CMAC approach aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for frequency grounding on the acoustic attention.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsContrastive Learning