TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

Trung Dang; Sharath Rao; Ananya Gupta; Christopher Gagne; Panagiotis Tzirakis; Alice Baird; Jakub Piotr C{\l}apa; Peter Chin; Alan Cowen

arXiv:2602.23068·cs.SD·February 27, 2026

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

Trung Dang, Sharath Rao, Ananya Gupta, Christopher Gagne, Panagiotis Tzirakis, Alice Baird, Jakub Piotr C{\l}apa, Peter Chin, Alan Cowen

PDF

Open Access 6 Models

TL;DR

This paper introduces a novel speech tokenization method that synchronizes text and acoustic features, enabling efficient, high-fidelity speech modeling within large language models, reducing hallucinations and improving performance.

Contribution

The paper proposes a new tokenization scheme for speech that aligns acoustic features with text tokens, allowing unified modeling in LLMs and enhancing speech synthesis and understanding.

Findings

01

Achieves competitive performance with state-of-the-art TTS and SLM systems.

02

Virtually eliminates content hallucinations in speech generation.

03

Reduces inference cost significantly.

Abstract

Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Generative Adversarial Networks and Image Synthesis