TTS-1 Technical Report

Oleg Atamanenko; Anna Chalova; Joseph Coombes; Nikki Cope; Phillip Dang; Zhifeng Deng; Jimmy Du; Michael Ermolenko; Feifan Fan; Yufei Feng; Cheryl Fichter; Pavel Filimonov; Louis Fischer; Kylan Gibbs; Valeria Gusarova; Pavel Karpik; Andreas Assad Kottner; Ian Lee; Oliver Louie; Jasmine Mai; Mikhail Mamontov; Suri Mao; Nurullah Morshed; Igor Poletaev; Florin Radu; Dmytro Semernia; Evgenii Shingarev; Vikram Sivaraja; Peter Skirko; Rinat Takhautdinov; Robert Villahermosa; Jean Wang

arXiv:2507.21138·cs.CL·July 30, 2025

TTS-1 Technical Report

Oleg Atamanenko, Anna Chalova, Joseph Coombes, Nikki Cope, Phillip Dang, Zhifeng Deng, Jimmy Du, Michael Ermolenko, Feifan Fan, Yufei Feng, Cheryl Fichter, Pavel Filimonov, Louis Fischer, Kylan Gibbs, Valeria Gusarova, Pavel Karpik, Andreas Assad Kottner, Ian Lee, Oliver Louie

PDF

3 Models

TL;DR

This paper introduces Inworld TTS-1, a set of Transformer-based autoregressive TTS models optimized for high quality, expressiveness, and real-time on-device speech synthesis across multiple languages, with state-of-the-art performance.

Contribution

The paper presents two scalable Transformer-based TTS models, TTS-1 and TTS-1-Max, with innovative training procedures achieving state-of-the-art results and supporting multilingual, expressive speech synthesis.

Findings

01

TTS-1-Max has 8.8B parameters for high-quality synthesis.

02

TTS-1 is optimized for real-time, on-device use with 1.6B parameters.

03

Both models achieve state-of-the-art performance on benchmarks.

Abstract

We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.