Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric   Videos

Changan Chen; Puyuan Peng; Ami Baid; Zihui Xue; Wei-Ning Hsu; David; Harwath; Kristen Grauman

arXiv:2406.09272·cs.CV·July 26, 2024

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David, Harwath, Kristen Grauman

PDF

Open Access

TL;DR

This paper introduces AV-LDM, a novel ambient-aware model for generating realistic action sounds from egocentric videos, effectively disentangling foreground and background sounds and enabling controllable, semantically aligned audio synthesis.

Contribution

The paper presents AV-LDM, the first model to focus on faithful video-to-audio generation from uncurated in-the-wild videos, with a new audio-conditioning mechanism and retrieval-augmented generation.

Findings

01

Outperforms existing methods in generating action sounds.

02

Enables controllable ambient sound generation.

03

Shows potential for application to computer graphics clips.

Abstract

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization

MethodsFocus