EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos
Aashish Rai, Srinath Sridhar

TL;DR
EgoSonics is a novel method that generates synchronized, semantically meaningful audio for silent egocentric videos, enabling new applications in VR and data augmentation.
Contribution
It introduces a new approach using latent diffusion models and SyncroNet for synchronized audio generation from silent egocentric videos, addressing limitations of prior work.
Findings
Outperforms existing methods in audio quality
Achieves better synchronization in generated audio
Enhances video summarization tasks
Abstract
We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strengths of latent diffusion models for conditioned audio synthesis. We first encode and process paired audio-video data to make them suitable for generation. The encoded data is then used to train a model that can generate an audio track that captures the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Music Technology and Sound Studies
MethodsDiffusion
