Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
Kent Fujiwara, Mikihiro Tanaka, Qing Yu

TL;DR
This paper highlights the importance of temporal accuracy in motion-language models, introduces the CAR evaluation to identify chronological misalignments, and proposes training with shuffled event sequences to improve temporal understanding.
Contribution
It introduces the Chronologically Accurate Retrieval (CAR) task and training method, emphasizing the need for temporal alignment in motion-language models, which was previously overlooked.
Findings
CAR reveals many models fail in event chronology understanding.
Training with shuffled event sequences improves temporal alignment.
Enhanced models show better performance in text-motion retrieval and generation.
Abstract
With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Human Motion and Animation
MethodsALIGN
