Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

TL;DR
This paper presents a novel diffusion-based framework for language-guided joint audio-visual editing, enabling one-shot adaptation and semantic enhancement to produce consistent, contextually edited audio-visual content.
Contribution
It introduces a one-shot adaptation approach for diffusion models and a cross-modal semantic enhancement to improve language-guided audio-visual editing.
Findings
Effective one-shot domain transfer with minimal samples
Improved semantic consistency in audio-visual editing
Outperforms baseline methods in experiments
Abstract
In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add new sounds contextualized to the visual content. To address this task, we propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas. Firstly, we propose a one-shot adaptation approach to tailor generative diffusion models for audio-visual content editing. With as few as one audio-visual sample, we jointly transfer the audio and vision diffusion models to the target domain. After fine-tuning, our model enables consistent generation of this audio-visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization
MethodsDiffusion
