Self-Supervised Visual Acoustic Matching

Arjun Somayazulu; Changan Chen; Kristen Grauman

arXiv:2307.15064·cs.MM·November 27, 2023·2 cites

Self-Supervised Visual Acoustic Matching

Arjun Somayazulu, Changan Chen, Kristen Grauman

PDF

Open Access 1 Video

TL;DR

This paper introduces a self-supervised method for visual acoustic matching that does not require paired training data, enabling more flexible and diverse training and outperforming existing methods on various real-world datasets.

Contribution

It presents a novel self-supervised approach using a conditional GAN framework and a new residual acoustic information metric for visual acoustic matching.

Findings

01

Outperforms state-of-the-art on multiple datasets

02

Works with both real-world and simulated data

03

Effectively disentangles room acoustics from audio

Abstract

Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-Supervised Visual Acoustic Matching· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training