TL;DR
VoxInstruct introduces a unified multilingual codec language model that enables expressive, fine-grained, instruction-guided speech synthesis directly from human instructions, aligning speech generation with other multimodal AIGC tasks.
Contribution
It presents a novel framework extending text-to-speech to a general instruction-to-speech task, incorporating speech semantic tokens and CFG strategies for enhanced expressiveness and control.
Findings
Supports combining speech prompt and instruction for expressive synthesis
Uses speech semantic tokens for content extraction from instructions
Employs CFG strategies to improve instruction adherence
Abstract
Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
