Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Susan Liang; Chao Huang; Yapeng Tian; Anurag Kumar; Chenliang Xu

arXiv:2410.07463·cs.CV·November 12, 2024

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

PDF

Open Access

TL;DR

This paper presents a novel diffusion-based framework for language-guided joint audio-visual editing, enabling one-shot adaptation and semantic enhancement to produce consistent, contextually edited audio-visual content.

Contribution

It introduces a one-shot adaptation approach for diffusion models and a cross-modal semantic enhancement to improve language-guided audio-visual editing.

Findings

01

Effective one-shot domain transfer with minimal samples

02

Improved semantic consistency in audio-visual editing

03

Outperforms baseline methods in experiments

Abstract

In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add new sounds contextualized to the visual content. To address this task, we propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas. Firstly, we propose a one-shot adaptation approach to tailor generative diffusion models for audio-visual content editing. With as few as one audio-visual sample, we jointly transfer the audio and vision diffusion models to the target domain. After fine-tuning, our model enables consistent generation of this audio-visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization

MethodsDiffusion