M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production
Alexandre Symeonidis-Herzig, Jianhe Low, Ozge Mercanoglu Sincan, Richard Bowden

TL;DR
This paper introduces M3T, a multi-modal motion token system that captures manual and non-manual features for sign language production, achieving state-of-the-art results on multiple benchmarks.
Contribution
It proposes SMPL-FX and modality-specific quantization for comprehensive sign language motion representation, enabling improved sign language production with non-manual features.
Findings
Achieves state-of-the-art quality on three benchmarks.
Reaches 58.3% accuracy on non-manual feature distinction.
Outperforms previous pose-based baselines.
Abstract
Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME's rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Interactive and Immersive Displays · Face recognition and analysis
