TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation
Yiyang Cao, Yunze Deng, Ziyu Lin, Bin Feng, Xinggang Wang, Wenyu Liu, Dandan Zheng, Jingdong Chen

TL;DR
TriC-Motion introduces a unified causal diffusion framework that models spatial, temporal, and frequency domains simultaneously for improved text-to-motion generation quality.
Contribution
It proposes a novel tri-domain causal modeling framework with domain-specific modules and a fusion mechanism, enhancing motion generation fidelity and coherence.
Findings
Achieves R@1 of 0.612 on HumanML3D dataset
Outperforms state-of-the-art methods in motion quality and alignment
Effectively disentangles motion-irrelevant noise
Abstract
Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention.…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method achieves remarkable improvement on the R Precision metric. 2. The paper is well-written, ensuring that its content is easily understandable for readers. 3. It is the first time for casual learning to be used in text-to-motion generation, making significant contributions to the research community.
My primary concern is the choice of baselines. Under the HumanML3D evaluation protocol, the evaluator is too weak: many recent methods already surpass the 'ground truth', making R-Precision on HumanML3D unreliable. Meanwhile, the FID gap to stronger methods is large (0.285 vs 0.033), so the proposed method shows no advantage on HumanML3D. Porting the approach to a MoMask baseline should not be difficult; the authors should adopt a more appropriate baseline; otherwise, it may look like trading mo
1. The first work that simultaneously integrates spatial, temporal, and frequency domains into a unified motion generation framework 2. Introduces a causality-based counterfactual motion disentangler to expose motion-irrelevant cues and disentangle the real modeling contributions of each domain. 3. Provides ablation studies indicating the effectiveness of each domain branch and the causal-intervention design.
1. The paper uses a perceptual loss defined in the same motion–text embedding space used by the HumanML3D evaluator (the author could clear this point if I'm wrong). Using the same feature extractor for training and inference would inflate the performancve. The author could do an ablation study that removes this loss term to show that the R-precision gain is not from this loss term. 2. No visualization results. Quantitative metrics in text-to-motion is proven to be fragile and sometimes misalig
1.The proposed method demonstrates strong text–motion consistency. 2.The introduction of causal learning reduces the impact of irrelevant information on motion generation.
1.Please provide t-SNE or other visualization analyses that disentangle motion-irrelevant and motion-relevant information to demonstrate the effectiveness of the proposed method. 2.The paper’s joint temporal–frequency–spatial strategy improves text–motion alignment (R@1, R@2, R@3), but FID did not improve; therefore you cannot claim that generation quality has improved, and the conclusions stated in the abstract are not supported. 3.What advantages does using DistilBERT for word-level and sent
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
