M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production

Alexandre Symeonidis-Herzig; Jianhe Low; Ozge Mercanoglu Sincan; Richard Bowden

arXiv:2603.23617·cs.CV·March 26, 2026

M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production

Alexandre Symeonidis-Herzig, Jianhe Low, Ozge Mercanoglu Sincan, Richard Bowden

PDF

Open Access

TL;DR

This paper introduces M3T, a multi-modal motion token system that captures manual and non-manual features for sign language production, achieving state-of-the-art results on multiple benchmarks.

Contribution

It proposes SMPL-FX and modality-specific quantization for comprehensive sign language motion representation, enabling improved sign language production with non-manual features.

Findings

01

Achieves state-of-the-art quality on three benchmarks.

02

Reaches 58.3% accuracy on non-manual feature distinction.

03

Outperforms previous pose-based baselines.

Abstract

Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME's rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Interactive and Immersive Displays · Face recognition and analysis