Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee; Juhan Nam; Jiyoung Lee

arXiv:2512.02650·cs.CV·March 30, 2026

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee, Juhan Nam, Jiyoung Lee

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces SELVA, a text-conditioned model for selective video-to-audio generation that extracts user-intended sounds from videos, improving multimedia editing and creative control.

Contribution

The paper proposes a novel approach using explicit text prompts and supplementary tokens for robust, selective audio extraction from videos, with a self-supervised video-mixing scheme.

Findings

01

SELVA outperforms baselines in audio quality, semantic alignment, and synchronization.

02

Effective suppression of irrelevant sounds through supplementary tokens.

03

Self-supervised training enables high-quality audio extraction without mono audio supervision.

Abstract

This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt-relevant sound-source visual features from the video encoder. To suppress text-irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross-attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video-mixing scheme in a self-supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jnwnlee/selva
github

Models

🤗
jnwnlee/SelVA
model

Datasets

jnwnlee/vgg-monoaudio
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.