Robust Audio-Visual Instance Discrimination

Pedro Morgado; Ishan Misra; Nuno Vasconcelos

arXiv:2103.15916·cs.CV·March 31, 2021

Robust Audio-Visual Instance Discrimination

Pedro Morgado, Ishan Misra, Nuno Vasconcelos

PDF

TL;DR

This paper introduces a robust self-supervised learning method for audio-visual representations that mitigates noise from faulty positives and negatives, improving action recognition and transfer learning.

Contribution

It proposes weighted contrastive loss and a soft target distribution for instance discrimination, addressing noise issues in audio-visual self-supervised learning.

Findings

01

Improved action recognition accuracy

02

Enhanced transfer learning performance

03

Effective noise mitigation in contrastive learning

Abstract

We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other. To limit the detrimental impact of faulty positives, we optimize a weighted contrastive learning loss, which down-weighs their contribution to the overall loss. Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives. To alleviate the impact of faulty…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Learning