LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Zhenxiong Tan; Xinyin Ma; Gongfan Fang; Xinchao Wang

arXiv:2407.10468·cs.SD·July 16, 2024

LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang

PDF

Open Access 1 Repo

TL;DR

LiteFocus is a novel method that accelerates long audio synthesis in latent diffusion models by optimizing self-attention, reducing inference time by nearly double while improving audio quality for extended clips.

Contribution

The paper introduces LiteFocus, a dual sparse attention mechanism that enhances inference speed and quality in long audio synthesis with latent diffusion models.

Findings

01

Inference time reduced by 1.99x for 80-second clips

02

Improved audio quality in long audio synthesis

03

Effective handling of long audio sequences with sparse attention

Abstract

Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to longer audio without adaptation. In response to these issues, we introduce a novel approach, LiteFocus that enhances the inference of existing audio latent diffusion models in long audio synthesis. Observed the attention pattern in self-attention, we employ a dual sparse form for attention calculation, designated as same-frequency focus and cross-frequency compensation, which curtails the attention computation under same-frequency constraints, while enhancing audio quality through cross-frequency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuanshi9815/litefocus
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need · Focus · Diffusion