GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Quanwei Yang; Luying Huang; Kaisiyuan Wang; Jiazhi Guan; Shengyi He; Fengguo Li; Hang Zhou; Lingyun Yu; Yingying Li; Haocheng Feng; Hongtao Xie

arXiv:2507.22731·cs.MM·January 1, 2026

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, Hongtao Xie

PDF

TL;DR

GestureHYDRA is a novel system for generating semantically explicit co-speech hand gestures using a hybrid-modality diffusion transformer and retrieval-augmented generation, improving gesture activation and efficiency.

Contribution

The paper introduces a new dataset and a hybrid-modality diffusion transformer architecture for semantically meaningful gesture synthesis, with a cascaded retrieval-augmented strategy for enhanced gesture activation.

Findings

01

Outperforms existing methods in gesture quality and activation.

02

Achieves higher semantic relevance in generated gestures.

03

Demonstrates efficient and versatile gesture production.

Abstract

While increasing attention has been paid to co-speech gesture synthesis, most previous works neglect to investigate hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system GestureHYDRA built upon a hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.