Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang

TL;DR
This paper systematically studies and scales open discrete audio foundation models that jointly model semantic, acoustic, and text tokens, enabling versatile audio and cross-modal applications.
Contribution
It introduces a validated training recipe, scaling law insights, and the SODA model suite for open discrete audio modeling with joint semantic, acoustic, and text tokens.
Findings
Optimal data growth rate is 1.6 times faster than model size.
Scaling laws are validated across 64 models from 135M to 4B parameters.
SODA models outperform existing models and support diverse audio/text tasks.
Abstract
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning to FLOPs, finding that optimal data grows 1.6 faster than optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
