LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Zekun Li; Sizhe An; Chengcheng Tang; Chuan Guo; Ivan Shugurov; Linguang Zhang; Amy Zhao; Srinath Sridhar; Lingling Tao; Abhay Mittal

arXiv:2602.12370·cs.CV·April 20, 2026

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Zekun Li, Sizhe An, Chengcheng Tang, Chuan Guo, Ivan Shugurov, Linguang Zhang, Amy Zhao, Srinath Sridhar, Lingling Tao, Abhay Mittal

PDF

TL;DR

LLaMo is a unified multimodal model that extends pretrained language models to generate and understand motion from text, using continuous latent spaces and a Mixture-of-Transformers architecture for real-time, high-fidelity motion tasks.

Contribution

It introduces a novel continuous latent space and a Mixture-of-Transformers design to unify motion and language understanding and generation within pretrained LLMs.

Findings

01

Achieves high-fidelity text-to-motion generation

02

Enables real-time streaming motion generation (>30 FPS)

03

Excels in zero-shot motion generation and captioning

Abstract

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.