AudioScenic: Audio-Driven Video Scene Editing
Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang

TL;DR
AudioScenic is a novel framework that uses audio signals to guide background editing in videos while preserving foreground content, improving temporal consistency and visual diversity.
Contribution
We introduce AudioScenic, a new audio-driven video scene editing framework with modules for semantic injection, background masking, and audio-guided temporal control.
Findings
Outperforms existing methods on DAVIS and Audioset datasets.
Enhances temporal consistency with a new temporal score metric.
Effectively controls background editing guided by audio signals.
Abstract
Audio-driven visual scene editing endeavors to manipulate the visual background while leaving the foreground content unchanged, according to the given audio signals. Unlike current efforts focusing primarily on image editing, audio-driven video scene editing has not been extensively addressed. In this paper, we introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. As our focus is on background editing, we further introduce a SceneMasker module, which maintains the integrity of the foreground content during the editing process. AudioScenic exploits the inherent properties of audio, namely, audio magnitude and frequency, to guide the editing process, aiming to control the temporal dynamics and enhance the temporal consistency. First, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Music Technology and Sound Studies
MethodsFocus
