Lets Play Music: Audio-driven Performance Video Generation
Hao Zhu, Yi Li, Feixia Zhu, Aihua Zheng, Ran He

TL;DR
This paper introduces a novel task of generating synchronized performance videos from music audio, employing a multi-stage framework that combines global appearance, keypoint heatmaps, and structured temporal modeling for realistic results.
Contribution
The paper presents a new approach for audio-driven performance video synthesis, integrating keypoint heatmap transformation and a structured temporal UNet for improved spatial and temporal consistency.
Findings
Generated videos are highly synchronized with music.
The framework produces realistic and temporally consistent performance videos.
Experimental results demonstrate the effectiveness of the proposed method.
Abstract
We propose a new task named Audio-driven Per-formance Video Generation (APVG), which aims to synthesizethe video of a person playing a certain instrument guided bya given music audio clip. It is a challenging task to gener-ate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to achieve this new task to generate realisticand synchronized performance video from given music. Firstly,we provide both global appearance and local spatial informationby generating the coarse videos and keypoints of body and handsfrom a given music respectively. Then, we propose to transformthe generated keypoints to heatmap via a differentiable spacetransformer, since the heatmap offers more spatial informationbut is harder to generate directly from audio. Finally, wepropose a Structured Temporal UNet (STU) to extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
MethodsHeatmap
