TL;DR
WaMo introduces a wavelet-based multi-frequency framework for fine-grained text-motion retrieval, capturing detailed body part dynamics and improving semantic alignment between text descriptions and 3D motion sequences.
Contribution
It proposes a novel wavelet-based feature extraction method that enhances part-specific and temporal detail capture for better text-motion matching.
Findings
Achieves 17.0% and 18.2% improvements in Rsum on HumanML3D and KIT-ML datasets.
Outperforms existing state-of-the-art methods in text-motion retrieval.
Effectively captures multi-resolution motion details for precise semantic alignment.
Abstract
Text-Motion Retrieval (TMR) aims to retrieve 3D motion sequences semantically relevant to text descriptions. However, matching 3D motions with text remains highly challenging, primarily due to the intricate structure of human body and its spatial-temporal dynamics. Existing approaches often overlook these complexities, relying on general encoding methods that fail to distinguish different body parts and their dynamics, limiting precise semantic alignment. To address this, we propose WaMo, a novel wavelet-based multi-frequency feature extraction framework. It fully captures part-specific and time-varying motion details across multiple resolutions on body joints, extracting discriminative motion features to achieve fine-grained alignment with texts. WaMo has three key components: (1) Trajectory Wavelet Decomposition decomposes motion signals into frequency components that preserve both…
Peer Reviews
Decision·Submitted to ICLR 2026
- impressive results - trustable experimental settings - comprehensive ablations - qualitative evidence - theoretical grounding
- the proposed approach is very specific for the given task (3D motion retrieval). This could limit the interest for the overall CLR community - the contribution is rather architectural than conceptual
• Introduces wavelet-based multi-frequency analysis for fine-grained motion modeling. • Focuses on part-specific and temporal dynamics, improving text-motion alignment. • Achieves clear empirical gains on standard benchmarks. • Provides a coherent, modular pipeline (TWD, TWR, DMSP) from decomposition to retrieval embedding.
1. **Conceptual novelty and task-specific motivation are unclear.** WaMo combines multi-frequency decomposition and disordered motion sequence prediction (DMSP), but each component has clear precedents. Frequency-based representations have been used in motion-related tasks (e.g., WaveletMotion, WaveAR, HumanMAC), local/part-level modeling is already incorporated in methods like MoPa(MotionPatch,CVPR2024), and shuffling/reordering sequences resembles common self-supervised tasks in video and imag
• Uses multi-frequency decomposition to better represent different motion dynamics across time scales. • The DMSP task encourages the model to understand temporal structure, which may help fine-grained text-motion alignment. • Empirical gains on standard datasets demonstrate practical effectiveness.
1. **Motivation needs further clarification.** The motivation presented in Figure 1 is not very representative. From the figure, the three frequency scales show rather flat trajectories between frames 50–100, followed by sharp changes around 150. However, it remains unclear what these low- and high-frequency components actually correspond to in terms of specific human motions. The paper would benefit from deeper analysis and interpretation of the relationship between motion dynamics and their fr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
