Visual Acoustic Matching
Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

TL;DR
This paper presents a new task called visual acoustic matching, where audio is transformed to match the acoustics of a target environment using visual cues, and introduces a cross-modal transformer model trained with self-supervision.
Contribution
The paper proposes the first approach to visual acoustic matching using a cross-modal transformer and self-supervised learning from in-the-wild videos.
Findings
Outperforms traditional acoustic matching methods.
Successfully translates speech to various real-world environments.
Uses self-supervised training to learn from unlabeled videos.
Abstract
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization
