LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
Junkun Jiang, Ho Yin Au, Jingyu Xiang, Jie Chen

TL;DR
LaMoGen leverages a symbolic motion representation called LabanLite and large language models to generate interpretable, controllable human motions from language, addressing limitations of previous embedding-based methods.
Contribution
The paper introduces LabanLite for symbolic motion encoding and LaMoGen, a framework that uses LLMs for symbolic reasoning to synthesize linguistically grounded motions.
Findings
Outperforms prior methods on benchmark and public datasets.
Provides interpretable and controllable motion generation.
Establishes a new baseline for text-to-motion synthesis.
Abstract
Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. LabanLite’s design (rooted in Labanotation) resolves the "black-box" issue of joint text-motion embeddings. Its separation of conceptual (direction/level/hold) and detail (orientation/bend/effort) symbols enables explicit, human-readable alignment between text instructions and motion trajectories. 2. LaMoGen pioneers LLM-driven autonomous symbolic composition for motion generation, rather than using LLMs as passive decomposers. Retrieval-augmented prompting allows LLMs to reason about tempora
1. LabanLite’s high-level abstraction fails to capture individual differences (e.g., movement speed variations) and fine-grained semantics (e.g., finger/toe gestures), leading to higher FID scores compared to baselines that model low-level motion variations. 2. Performance heavily relies on LLM strength (GPT-4.1 outperforms smaller models) and is constrained by context windows—adding more than 3 retrieval examples provides no benefit and may degrade performance. 3. Labanotation requires a learni
1. It achieves precise control over motion details, e.g., step count and timing through a human-readable symbolic representation. 2. It uniquely uses an LLM for high-level symbolic reasoning to plan motions, separating complex logic from low-level synthesis. 3. It introduces benchmark using a set of metrics (SMT, TMP, HMN) to assess temporal, semantic, and coordination alignment.
1. The paper's quantitative evaluation on the widely used HumanML3D benchmark does not include comparisons with several recent and highly influential state-of-the-art models, such as MoMask and MotionGPT. Furthermore, while its performance is competitive, its FID and R-Precision scores do not consistently surpass the older baselines it was compared against. 2. The paper establishes a LabanLite codebook of size 158 but does not provide a clear rationale or empirical validation for this specific
* **Interpretable intermediate representation:** The Laban-style codebook offers human-readable motion factors (who/what moves, where/when), aiding analysis and potential editing/conditioning. * **Decomposed generation:** Separating **conceptual planning** from **kinematic detailing** is a clear design that addresses known issues in long-horizon consistency and controllability. * **New evaluation perspective:** The proposed Laban metrics target temporal structure and multi-part coherence, comple
* There is no quantitative or qualitative head-to-head against recent token-based/VQ approaches (TM2T, MotionGPT, MoMask, Motion-Agent). Those methods typically rely on VQ-VAE codebooks; without direct comparison, it is difficult to judge **representation power** and the efficacy–realism trade-off of the proposed Laban codebook. * All three Laban metrics depend on the paper’s **rule-based symbol detector** and fixed thresholds. This couples the evaluation to the authors’ discretization choices.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Human Pose and Action Recognition
