MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers   for Open-Domain Sound Generation

Trung X. Pham; Tri Ton; Chang D. Yoo

arXiv:2410.02130·cs.SD·February 14, 2025

MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

Trung X. Pham, Tri Ton, Chang D. Yoo

PDF

Open Access

TL;DR

MDSGen is a new framework for open-domain sound generation that uses masked diffusion transformers, reducing resource requirements and increasing efficiency while maintaining high accuracy, compared to existing models.

Contribution

Introduces MDSGen, a resource-efficient masked diffusion transformer framework with a novel video feature removal and temporal-aware masking strategy for sound generation.

Findings

01

Achieves 97.9% alignment accuracy with 5M parameters.

02

Uses 172x fewer parameters and 371% less memory than state-of-the-art.

03

Offers 36x faster inference than existing models.

Abstract

We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, \texttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$ % alignment accuracy, using $172 \times$ fewer parameters, $371$ % less memory, and offering $36 \times$ faster inference than the current 860M-parameter state-of-the-art model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsDiffusion