Looking Similar, Sounding Different: Leveraging Counterfactual   Cross-Modal Pairs for Audiovisual Representation Learning

Nikhil Singh; Chih-Wei Wu; Iroro Orife; Mahdi Kalayeh

arXiv:2304.05600·cs.SD·June 11, 2024·1 cites

Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Nikhil Singh, Chih-Wei Wu, Iroro Orife, Mahdi Kalayeh

PDF

Open Access

TL;DR

This paper explores how using dubbed audio tracks as counterfactual pairs in contrastive learning enhances audiovisual representations, leading to improved robustness across various downstream tasks without harming linguistic performance.

Contribution

It introduces a novel approach leveraging dubbed audio to improve audiovisual contrastive learning, addressing speech variation in scene-level representations.

Findings

01

Improved performance on audiovisual tasks with dubbed audio augmentation

02

Enhanced robustness of audiovisual representations across diverse scenarios

03

Dubbed audio does not significantly impact linguistic task performance

Abstract

Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Subtitles and Audiovisual Media