Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Kai Qiu; Xiang Li; Hao Chen; Jie Sun; Jinglu Wang; Zhe Lin; Marios; Savvides; Bhiksha Raj

arXiv:2408.09027·cs.SD·December 18, 2024

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios, Savvides, Bhiksha Raj

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel scale-level audio tokenizer and an autoregressive modeling framework that significantly improves the efficiency of audio generation, achieving 35 times faster inference and better quality on the AudioSet benchmark.

Contribution

It proposes a new scale-level audio tokenizer and a next-scale prediction framework that reduces training and inference costs for autoregressive audio models.

Findings

01

Achieves 35× faster inference speed compared to baselines.

02

Improves Fréchet Audio Distance (FAD) by +1.33 on AudioSet.

03

Demonstrates effectiveness of the proposed methods through comprehensive analysis.

Abstract

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbf{S}cale-level \textbf{A}udio \textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbf{A}coustic \textbf{A}uto\textbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiuk2/aar
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings