ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

Jiahui Sun; Weining Wang; Mingzhen Sun; Yirong Yang; Xinxin Zhu; Jing Liu

arXiv:2511.12072·cs.MM·November 18, 2025

ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

Jiahui Sun, Weining Wang, Mingzhen Sun, Yirong Yang, Xinxin Zhu, Jing Liu

PDF

Open Access

TL;DR

ProAV-DiT is a novel model that efficiently generates synchronized audio-video content by aligning modalities in a unified latent space and employing a multi-scale attention mechanism, achieving high quality with lower computational costs.

Contribution

The paper introduces ProAV-DiT, a new framework combining latent diffusion and transformer architectures for synchronized audio-video generation with improved efficiency and quality.

Findings

01

Outperforms existing methods in generation quality

02

Reduces computational overhead significantly

03

Achieves high-fidelity synchronized audio-video content

Abstract

Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music Technology and Sound Studies