MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

Siyuan Wang; Jiawei Liu; Wei Wang; Yeying Jin; Jinsong Du; Zhi Han

arXiv:2505.23120·cs.CV·March 16, 2026

MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

Siyuan Wang, Jiawei Liu, Wei Wang, Yeying Jin, Jinsong Du, Zhi Han

PDF

1 Repo

TL;DR

The paper introduces MMGT, a two-stage network that generates realistic co-speech gesture videos by combining audio, motion masks, and pose information, improving motion realism and detail accuracy.

Contribution

It proposes a novel two-stage framework using motion masks and pose generation to enhance co-speech gesture video synthesis without relying on additional priors.

Findings

01

Improved video quality and lip-sync accuracy.

02

Enhanced large gesture motion capture.

03

Better region-specific detail control.

Abstract

Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of body parts in terms of motion amplitude, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in videos, resulting in more noticeable artifacts and distortions. Existing approaches typically address this issue by adding extra prior inputs, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, along with motion masks and pose videos generated from the audio signal, to jointly generate synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio2Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sia-ide/mmgt
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Diffusion