MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model
Matthew Baas, Pieter Scholtz, Arnav Mehta, Elliott Dyson, Akshat, Prakash, Herman Kamper

TL;DR
MARS6 is a compact, hierarchical transformer-based TTS model that achieves high-quality, expressive speech synthesis and zero-shot voice cloning with improved efficiency and stability, comparable to larger models.
Contribution
The paper introduces MARS6, a small hierarchical-encoder decoder transformer for TTS that enhances expressiveness, stability, and efficiency using recent modeling and training techniques.
Findings
Achieves high-quality TTS comparable to larger models.
Improves stability and reduces repetitive outputs.
Efficient processing of long-form text at 12 Hz.
Abstract
Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
