EmojiVoice: Towards long-term controllable expressivity in robot speech
Paige Tutt\"os\'i, Shivam Mehta, Zachary Syvenky, Bermet Burkanova, Gustav Eje Henter, Angelica Lim

TL;DR
EmojiVoice is a customizable TTS toolkit enabling social robots to produce long-term, expressive speech with fine-grained control, demonstrated through three diverse case studies showing improved expressivity in storytelling.
Contribution
The paper introduces EmojiVoice, a novel TTS toolkit with emoji-prompting for controllable expressivity in robot speech, suitable for offline deployment and real-time use.
Findings
Emoji prompting enhances long-term speech expressivity in storytelling.
Expressive voice was less preferred in robot assistant scenarios.
Real-time speech generation is feasible with lightweight Matcha-TTS backbone.
Abstract
Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with ``expressive'' joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text-to-speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji-prompting to allow fine-grained control of expressivity on a phase level and use the lightweight Matcha-TTS backbone to generate speech in real-time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech-to-speech interactive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Communication and Language · Social Robot Interaction and HRI · Speech and dialogue systems
