TL;DR
This paper introduces SATO, a framework designed to improve the stability of text-to-motion models by addressing output inconsistency issues caused by unstable attention patterns, while maintaining high accuracy.
Contribution
SATO provides a formal framework with modules for stable attention and prediction, enhancing robustness against input perturbations in text-to-motion models.
Findings
SATO significantly improves stability against synonym perturbations.
SATO maintains high accuracy comparable to existing models.
The framework effectively reduces output inconsistency caused by attention instability.
Abstract
Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
