Voice Impression Control in Zero-Shot TTS

Kenichi Fujita; Shota Horiguchi; Yusuke Ijima

arXiv:2506.05688·cs.SD·February 19, 2026

Voice Impression Control in Zero-Shot TTS

Kenichi Fujita, Shota Horiguchi, Yusuke Ijima

PDF

Open Access

TL;DR

This paper introduces a zero-shot TTS method that controls voice impressions using a low-dimensional vector, enabling natural language-based impression specification and demonstrating effectiveness through evaluations.

Contribution

The paper presents a novel zero-shot TTS approach that uses a low-dimensional vector for impression control, including a method to generate this vector from natural language descriptions.

Findings

01

Effective impression control demonstrated in evaluations

02

Natural language-based impression generation enabled

03

No manual optimization needed for impression specification

Abstract

Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research