Visual to Sound: Generating Natural Sound for Videos in the Wild
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg

TL;DR
This paper introduces a learning-based method to generate realistic sounds from video frames, aiming to enhance virtual reality experiences and accessibility for visually impaired users by synchronizing audio with visual content.
Contribution
It presents the first approach to generate raw waveform sounds directly from videos, demonstrating promising realism and synchronization in diverse scenarios.
Findings
Generated sounds are fairly realistic.
Sounds show good temporal synchronization.
Method performs well across various sound types.
Abstract
As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
