Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Junwon Lee, Juhan Nam, Jiyoung Lee

TL;DR
This paper introduces SELVA, a text-conditioned model for selective video-to-audio generation that extracts user-intended sounds from videos, improving multimedia editing and creative control.
Contribution
The paper proposes a novel approach using explicit text prompts and supplementary tokens for robust, selective audio extraction from videos, with a self-supervised video-mixing scheme.
Findings
SELVA outperforms baselines in audio quality, semantic alignment, and synchronization.
Effective suppression of irrelevant sounds through supplementary tokens.
Self-supervised training enables high-quality audio extraction without mono audio supervision.
Abstract
This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt-relevant sound-source visual features from the video encoder. To suppress text-irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross-attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video-mixing scheme in a self-supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
