Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

Lokesh Kumar; Nirmesh Shah; Ashishkumar P. Gudmalwar; Pankaj Wasnik

arXiv:2603.19831·eess.AS·March 23, 2026

Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

Lokesh Kumar, Nirmesh Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik

PDF

Open Access 1 Video

TL;DR

Gesture2Speech introduces a multimodal TTS system that uses hand gesture cues to enhance prosody and synchronize speech with gestures, improving naturalness and expressiveness in synthesized speech.

Contribution

It is the first to incorporate hand gesture cues for prosody modulation in neural speech synthesis, using a novel multimodal MoE architecture and alignment loss.

Findings

01

Outperforms state-of-the-art baselines in naturalness

02

Achieves better gesture-speech synchrony

03

Demonstrates effective prosody control via gestures

Abstract

Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?· underline

Taxonomy

TopicsEmotion and Mood Recognition · Hand Gesture Recognition Systems · Social Robot Interaction and HRI