Robust Audio-Visual Instance Discrimination
Pedro Morgado, Ishan Misra, Nuno Vasconcelos

TL;DR
This paper introduces a robust self-supervised learning method for audio-visual representations that mitigates noise from faulty positives and negatives, improving action recognition and transfer learning.
Contribution
It proposes weighted contrastive loss and a soft target distribution for instance discrimination, addressing noise issues in audio-visual self-supervised learning.
Findings
Improved action recognition accuracy
Enhanced transfer learning performance
Effective noise mitigation in contrastive learning
Abstract
We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other. To limit the detrimental impact of faulty positives, we optimize a weighted contrastive learning loss, which down-weighs their contribution to the overall loss. Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives. To alleviate the impact of faulty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
