Robust Audio-Visual Instance Discrimination via Active Contrastive Set   Mining

Hanyu Xuan; Yihong Xu; Shuo Chen; Zhiliang Wu; Jian Yang; Yan Yan,; Xavier Alameda-Pineda

arXiv:2204.12366·cs.MM·April 27, 2022

Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining

Hanyu Xuan, Yihong Xu, Shuo Chen, Zhiliang Wu, Jian Yang, Yan Yan,, Xavier Alameda-Pineda

PDF

Open Access

TL;DR

This paper introduces Active Contrastive Set Mining (ACSM) to enhance audio-visual instance discrimination by mining more informative negatives, significantly improving action and sound recognition performance across multiple datasets.

Contribution

The paper proposes a novel ACSM approach that effectively mines informative and diverse negatives, addressing the limitations of random sampling in AVID.

Findings

01

Significant performance improvements on multiple datasets

02

Enhanced robustness of AVID models

03

Effective integration of semantically-aware hard-sample mining

Abstract

The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision. As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm. Existing AVID methods construct the contrastive set by random sampling based on the assumption that the audio and visual clips from all other videos are not semantically related. We argue that this assumption is rough, since the resulting contrastive sets have a large number of faulty negatives. In this paper, we overcome this limitation by proposing a novel Active Contrastive Set Mining (ACSM) that aims to mine the contrastive sets with informative and diverse negatives for robust AVID. Moreover, we also integrate a semantically-aware hard-sample mining strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech and Audio Processing