TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans

Yueqian Guo; Tianzhao Li; Xin Lyu; Jiehaolin Chen; Zhaohan Wang; Sirui Xiao; Yurun Chen; Yezi He; Helin Li; Fan Zhang

arXiv:2506.01077·cs.GR·June 3, 2025

TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans

Yueqian Guo, Tianzhao Li, Xin Lyu, Jiehaolin Chen, Zhaohan Wang, Sirui Xiao, Yurun Chen, Yezi He, Helin Li, Fan Zhang

PDF

Open Access 1 Repo

TL;DR

TRiMM introduces a transformer-based framework for real-time, multi-modal 3D gesture generation in digital humans, combining attention mechanisms, sequence modeling, and gesture retrieval to enable responsive, high-quality co-speech gestures.

Contribution

The paper presents a novel multi-modal framework that achieves real-time 3D gesture synthesis with high accuracy and low latency, addressing limitations of previous methods in speed and long-text comprehension.

Findings

01

Achieves 120 fps inference speed on consumer GPUs

02

Maintains 0.15 seconds per-sentence latency

03

Outperforms state-of-the-art gesture generation methods

Abstract

Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

teroon/trimm-transformer-based-rich-motion-matching
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Human Motion and Animation