VoxInstruct: Expressive Human Instruction-to-Speech Generation with   Unified Multilingual Codec Language Modelling

Yixuan Zhou; Xiaoyu Qin; Zeyu Jin; Shuoyi Zhou; Shun Lei; Songtao; Zhou; Zhiyong Wu; Jia Jia

arXiv:2408.15676·cs.SD·August 29, 2024

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao, Zhou, Zhiyong Wu, Jia Jia

PDF

1 Repo

TL;DR

VoxInstruct introduces a unified multilingual codec language model that enables expressive, fine-grained, instruction-guided speech synthesis directly from human instructions, aligning speech generation with other multimodal AIGC tasks.

Contribution

It presents a novel framework extending text-to-speech to a general instruction-to-speech task, incorporating speech semantic tokens and CFG strategies for enhanced expressiveness and control.

Findings

01

Supports combining speech prompt and instruction for expressive synthesis

02

Uses speech semantic tokens for content extraction from instructions

03

Employs CFG strategies to improve instruction adherence

Abstract

Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thuhcsi/voxinstruct
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.