AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and   Video Generation

Kai Wang; Shijian Deng; Jing Shi; Dimitrios Hatzinakos; Yapeng Tian

arXiv:2406.07686·cs.CV·June 13, 2024·1 cites

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian

PDF

Open Access

TL;DR

AV-DiT introduces an efficient audio-visual diffusion transformer that leverages a shared backbone with lightweight adapters, enabling high-quality joint audio-visual content generation with reduced complexity and parameters.

Contribution

The paper presents a novel shared backbone diffusion transformer with modality-specific adapters for efficient joint audio-visual generation, achieving state-of-the-art results with fewer parameters.

Findings

01

State-of-the-art performance on AIST++ and Landscape datasets.

02

Significantly fewer tunable parameters compared to existing methods.

03

A shared image generative backbone suffices for joint audio-visual generation.

Abstract

Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques

MethodsDiffusion