MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
Nikolai Warner, Cameron Ethan Taylor, Irfan Essa, Apaar Sadhwani

TL;DR
MoCHA introduces a canonicalization framework for text captions in motion-text retrieval, reducing variance and improving cross-dataset transfer by standardizing language representations.
Contribution
It proposes a novel text canonicalization method, including learned variants with LLMs and distilled models, to enhance motion-text retrieval performance.
Findings
MoCHA achieves state-of-the-art results on HumanML3D and KIT-ML datasets.
Canonicalization reduces within-motion embedding variance by 11-19%.
Cross-dataset transfer improves significantly, with up to 94% gain from H to K.
Abstract
Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
