TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

Yiyang Cao; Yunze Deng; Ziyu Lin; Bin Feng; Xinggang Wang; Wenyu Liu; Dandan Zheng; Jingdong Chen

arXiv:2602.08462·cs.CV·February 10, 2026

TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

Yiyang Cao, Yunze Deng, Ziyu Lin, Bin Feng, Xinggang Wang, Wenyu Liu, Dandan Zheng, Jingdong Chen

PDF

Open Access 3 Reviews

TL;DR

TriC-Motion introduces a unified causal diffusion framework that models spatial, temporal, and frequency domains simultaneously for improved text-to-motion generation quality.

Contribution

It proposes a novel tri-domain causal modeling framework with domain-specific modules and a fusion mechanism, enhancing motion generation fidelity and coherence.

Findings

01

Achieves R@1 of 0.612 on HumanML3D dataset

02

Outperforms state-of-the-art methods in motion quality and alignment

03

Effectively disentangles motion-irrelevant noise

Abstract

Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The proposed method achieves remarkable improvement on the R Precision metric. 2. The paper is well-written, ensuring that its content is easily understandable for readers. 3. It is the first time for casual learning to be used in text-to-motion generation, making significant contributions to the research community.

Weaknesses

My primary concern is the choice of baselines. Under the HumanML3D evaluation protocol, the evaluator is too weak: many recent methods already surpass the 'ground truth', making R-Precision on HumanML3D unreliable. Meanwhile, the FID gap to stronger methods is large (0.285 vs 0.033), so the proposed method shows no advantage on HumanML3D. Porting the approach to a MoMask baseline should not be difficult; the authors should adopt a more appropriate baseline; otherwise, it may look like trading mo

Reviewer 02Rating 2Confidence 4

Strengths

1. The first work that simultaneously integrates spatial, temporal, and frequency domains into a unified motion generation framework 2. Introduces a causality-based counterfactual motion disentangler to expose motion-irrelevant cues and disentangle the real modeling contributions of each domain. 3. Provides ablation studies indicating the effectiveness of each domain branch and the causal-intervention design.

Weaknesses

1. The paper uses a perceptual loss defined in the same motion–text embedding space used by the HumanML3D evaluator (the author could clear this point if I'm wrong). Using the same feature extractor for training and inference would inflate the performancve. The author could do an ablation study that removes this loss term to show that the R-precision gain is not from this loss term. 2. No visualization results. Quantitative metrics in text-to-motion is proven to be fragile and sometimes misalig

Reviewer 03Rating 4Confidence 5

Strengths

1.The proposed method demonstrates strong text–motion consistency. 2.The introduction of causal learning reduces the impact of irrelevant information on motion generation.

Weaknesses

1.Please provide t-SNE or other visualization analyses that disentangle motion-irrelevant and motion-relevant information to demonstrate the effectiveness of the proposed method. 2.The paper’s joint temporal–frequency–spatial strategy improves text–motion alignment (R@1, R@2, R@3), but FID did not improve; therefore you cannot claim that generation quality has improved, and the conclusions stated in the abstract are not supported. 3.What advantages does using DistilBERT for word-level and sent

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications