Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Potsawee Manakul; Woody Haosheng Gan; Martijn Bartelds; Guangzhi Sun; William Held; Diyi Yang

arXiv:2602.16687·cs.SD·February 19, 2026

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang

PDF

Open Access 4 Models

TL;DR

This paper systematically studies and scales open discrete audio foundation models that jointly model semantic, acoustic, and text tokens, enabling versatile audio and cross-modal applications.

Contribution

It introduces a validated training recipe, scaling law insights, and the SODA model suite for open discrete audio modeling with joint semantic, acoustic, and text tokens.

Findings

01

Optimal data growth rate is 1.6 times faster than model size.

02

Scaling laws are validated across 64 models from 135M to 4B parameters.

03

SODA models outperform existing models and support diverse audio/text tasks.

Abstract

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3 \times 1 0^{18}$ to $3 \times 1 0^{20}$ FLOPs, finding that optimal data grows 1.6 $\times$ faster than optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing