Visual Acoustic Matching

Changan Chen; Ruohan Gao; Paul Calamia; Kristen Grauman

arXiv:2202.06875·cs.CV·June 15, 2022

Visual Acoustic Matching

Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

PDF

Open Access 1 Repo

TL;DR

This paper presents a new task called visual acoustic matching, where audio is transformed to match the acoustics of a target environment using visual cues, and introduces a cross-modal transformer model trained with self-supervision.

Contribution

The paper proposes the first approach to visual acoustic matching using a cross-modal transformer and self-supervised learning from in-the-wild videos.

Findings

01

Outperforms traditional acoustic matching methods.

02

Successfully translates speech to various real-world environments.

03

Uses self-supervised training to learn from unlabeled videos.

Abstract

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

see2sound/see2sound
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization