GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

Pinxin Liu; Luchuan Song; Junhua Huang; Haiyang Liu; Chenliang Xu

arXiv:2501.18898·cs.CV·August 5, 2025

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, Chenliang Xu

PDF

Open Access 1 Repo 1 Models

TL;DR

GestureLSM introduces a novel spatial-temporal flow-matching approach with latent shortcut learning for efficient, high-quality full-body gesture generation from speech, outperforming existing methods in speed and coherence.

Contribution

It presents a new flow-matching framework with latent shortcut learning and beta distribution sampling to improve gesture quality and inference speed.

Findings

01

Achieves state-of-the-art performance on BEAT2 dataset.

02

Significantly reduces inference time compared to existing methods.

03

Enhances coherence of full-body gestures through spatial-temporal modeling.

Abstract

Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

andypinxinliu/GestureLSM
pytorchOfficial

Models

🤗
pliu23/GestureLSM
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Hand Gesture Recognition Systems · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings