# Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion

**Authors:** Zheng Qin, Yabing Wang, Minghui Yang, Sanping Zhou, Ming Yang, Le Wang

arXiv: 2508.20604 · 2025-08-29

## TL;DR

This paper introduces Diverse-T2M, a novel method for text-to-3D human motion generation that incorporates aleatoric uncertainty to produce diverse, semantically consistent motions, advancing the state-of-the-art in diversity and quality.

## Contribution

It presents a transformer-based approach that explicitly models uncertainty using noise signals and a stochastic latent space sampler, improving diversity in text-to-motion generation.

## Key findings

- Significantly improves motion diversity on benchmark datasets.
- Maintains high text-motion semantic consistency.
- Outperforms existing methods in diversity metrics.

## Abstract

Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, \textit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20604/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20604/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/2508.20604/full.md

---
Source: https://tomesphere.com/paper/2508.20604