Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal   Transformers

Sohan Anisetty; James Hays

arXiv:2409.01591·cs.CV·September 4, 2024

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Sohan Anisetty, James Hays

PDF

Open Access

TL;DR

This paper introduces a novel multimodal motion synthesis framework that combines text and audio inputs using advanced transformer models, VQVAEs, and attention mechanisms to generate coherent and natural whole-body motions.

Contribution

It presents a new framework integrating multimodal inputs with transformers and VQVAEs, improving motion coherence and processing efficiency over prior methods.

Findings

01

Enhanced motion coherence and naturalness

02

Improved processing efficiency

03

Expanded multimodal motion synthesis capabilities

Abstract

Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation

MethodsSoftmax · Attention Is All You Need