FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Tanyu Chen; Tairan Chen; Kai Shen; Zhenghua Bao; Zhihui Zhang; Man Yuan; Yi Shi

arXiv:2601.11141·cs.SD·January 19, 2026

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Tanyu Chen, Tairan Chen, Kai Shen, Zhenghua Bao, Zhihui Zhang, Man Yuan, Yi Shi

PDF

Open Access 1 Models

TL;DR

Chroma 1.0 is an open-source, real-time spoken dialogue system that combines low-latency interaction with high-fidelity personalized voice cloning, improving speaker similarity significantly over previous models.

Contribution

It introduces the first open-source, end-to-end spoken dialogue model capable of real-time operation and personalized voice synthesis with high speaker similarity.

Findings

01

Achieves 10.96% improvement in speaker similarity over human baseline

02

Operates with a real-time factor of 0.43, enabling low-latency interactions

03

Supports multi-turn conversations with high-quality personalized voices

Abstract

Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FlashLabs/Chroma-4B
model· 911 dl· ♡ 342
911 dl♡ 342

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Speech and dialogue systems