Read, Watch and Scream! Sound Generation from Text and Video

Yujin Jeong; Yunji Kim; Sanghyuk Chun; Jiyoung Lee

arXiv:2407.05551·cs.CV·December 30, 2024

Read, Watch and Scream! Sound Generation from Text and Video

Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multimodal generative approach called urs that combines video and text cues to generate controllable, high-quality audio, improving flexibility and efficiency over existing methods.

Contribution

The method uniquely integrates video-based structural cues with text prompts to enhance audio generation control and efficiency in multimodal diffusion models.

Findings

01

Outperforms existing models in audio quality and controllability.

02

Enables user adjustments of energy, environment, and sound sources.

03

Demonstrates improved training efficiency with large triplet data.

Abstract

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called \ours, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naver-ai/rewas
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsDiffusion