TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment
Trung Dang, Sharath Rao, Ananya Gupta, Christopher Gagne, Panagiotis Tzirakis, Alice Baird, Jakub Piotr C{\l}apa, Peter Chin, Alan Cowen

TL;DR
This paper introduces a novel speech tokenization method that synchronizes text and acoustic features, enabling efficient, high-fidelity speech modeling within large language models, reducing hallucinations and improving performance.
Contribution
The paper proposes a new tokenization scheme for speech that aligns acoustic features with text tokens, allowing unified modeling in LLMs and enhancing speech synthesis and understanding.
Findings
Achieves competitive performance with state-of-the-art TTS and SLM systems.
Virtually eliminates content hallucinations in speech generation.
Reduces inference cost significantly.
Abstract
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Generative Adversarial Networks and Image Synthesis
