HALL-E: Hierarchical Neural Codec Language Model for Minute-Long   Zero-Shot Text-to-Speech Synthesis

Yuto Nishimura; Takumi Hirose; Masanari Ohi; Hideki Nakayama; and; Nakamasa Inoue

arXiv:2410.04380·eess.AS·October 11, 2024

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis

Yuto Nishimura, Takumi Hirose, Masanari Ohi, Hideki Nakayama, and, Nakamasa Inoue

PDF

Open Access

TL;DR

HALL-E introduces hierarchical token prediction and multi-resolution quantization to enable stable, minute-long zero-shot text-to-speech synthesis using neural audio codecs and large language models.

Contribution

The paper proposes a novel hierarchical token prediction model and a multi-resolution quantization framework to improve long-form speech synthesis with neural audio codecs.

Findings

01

Achieved stable minute-long speech synthesis in a single inference step.

02

Reduced audio token frame rate to as low as 8 Hz.

03

Demonstrated effectiveness on the VALL-E based framework.

Abstract

Recently, Text-to-speech (TTS) models based on large language models (LLMs) that translate natural language text into sequences of discrete audio tokens have gained great research attention, with advances in neural audio codec (NAC) models using residual vector quantization (RVQ). However, long-form speech synthesis remains a significant challenge due to the high frame rate, which increases the length of audio tokens and makes it difficult for autoregressive language models to generate audio tokens for even a minute of speech. To address this challenge, this paper introduces two novel post-training approaches: 1) Multi-Resolution Requantization (MReQ) and 2) HALL-E. MReQ is a framework to reduce the frame rate of pre-trained NAC models. Specifically, it incorporates multi-resolution residual vector quantization (MRVQ) module that hierarchically reorganizes discrete audio tokens through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling