CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies

TL;DR
This paper introduces CLUTCH, a novel system for generating and understanding natural hand motions from text in real-world scenarios, supported by a new large-scale dataset and innovative modeling techniques.
Contribution
The paper presents a new dataset 3D-HIW and a novel LLM-based system CLUTCH with SHIFT and geometric refinement for in-the-wild hand motion modeling.
Findings
State-of-the-art results on text-to-motion tasks
First benchmark for in-the-wild hand motion modeling
Improved generalization and reconstruction fidelity
Abstract
Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand…
Peer Reviews
Decision·ICLR 2026 Poster
1. The 3D-HIW dataset is a good contribution. The proposed VLM-based annotation pipeline is a clever and scalable approach to captioning this in-the-wild data. 2. The paper does a good job of validating its design choices with the ablations for the SHIFT tokenizer and the training stages. 3. The paper is well-written, the figures are informative, and the core ideas are articulated clearly.
1. Lack of Hand-Object Interaction (HOI) is the most significant limitation. The paper frames its work as "in-the-wild" yet the model only generates 3D hand motion. It does not model the objects being interacted with. True in-the-wild motion is almost entirely defined by HOI, which is explicitly left as future work. 2. The model is trained and evaluated exclusively on the authors' new dataset. It is unclear how CLUTCH would perform on other public benchmarks.
1. The motion dataset represents a highly valuable contribution to the field. 2. The SHIFT mechanism is well-motivated, effectively decoupling trajectory-level movements from fine-grained finger motions, and ablation studies demonstrate its effectiveness. Also, enabling bidirectional motion–text decoding is an innovative design, and it is noteworthy that this approach performs successfully in practice.
1. Since the hand motions are synthesized from a single textual description, how is motion diversity ensured? Are there mechanisms in CLUTCH to generate varied hand trajectories or poses from the same caption? 2. While the proposed approach is effective for isolated hand motions and the datasets are centered on hand-only movements, its omission of object interactions could constrain its applicability to more realistic, object-involved settings.
The scale of the hand motion dataset is unprecedented. The work provides a thorough analysis of the introduced dataset. The proposed method outperforms multiple baselines on the introduced dataset. The design choices of the method and filtering of the dataset are quantitatively supported by ablation studies. The proposed method supports both the text-to-motion and motion-to-text tasks simultaneously.
For a work introducing a novel dataset as its main contribution, more qualitative examples of the generated trajectories as well as hand poses and coarse/fine-grained textual descriptions are necessary. This is a major weakness, especially coupled with the following concern: The noun distribution in Figure 11 shows several undesirable entries being common in the dataset, e.g. "hand" (hand touching a hand?) and "cut" (a verb?). This raises questions about the quality of the dataset's noun/verb an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Hand Gesture Recognition Systems
