LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis
Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang

TL;DR
LiteFocus is a novel method that accelerates long audio synthesis in latent diffusion models by optimizing self-attention, reducing inference time by nearly double while improving audio quality for extended clips.
Contribution
The paper introduces LiteFocus, a dual sparse attention mechanism that enhances inference speed and quality in long audio synthesis with latent diffusion models.
Findings
Inference time reduced by 1.99x for 80-second clips
Improved audio quality in long audio synthesis
Effective handling of long audio sequences with sparse attention
Abstract
Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to longer audio without adaptation. In response to these issues, we introduce a novel approach, LiteFocus that enhances the inference of existing audio latent diffusion models in long audio synthesis. Observed the attention pattern in self-attention, we employ a dual sparse form for attention calculation, designated as same-frequency focus and cross-frequency compensation, which curtails the attention computation under same-frequency constraints, while enhancing audio quality through cross-frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need · Focus · Diffusion
