Audeo: Audio Generation for a Silent Performance Video
Kun Su, Xiulong Liu, Eli Shlizerman

TL;DR
Audeo is a system that converts silent piano performance videos into plausible, high-quality music by translating visual cues into symbolic representations and synthesizing audio, demonstrating the feasibility of visual-to-audio transformation.
Contribution
This work introduces a complete pipeline for generating music from silent performance videos, combining visual-to-symbolic translation and audio synthesis, which was not previously demonstrated.
Findings
Generated music has reasonable audio quality.
Music can be recognized with high precision by music identification software.
The system works effectively on 'in the wild' videos.
Abstract
We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video. Generation of music from visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events. To achieve the transformation we built a full pipeline named `\textit{Audeo}' containing three components. We first translate the video frames of the keyboard and the musician hand movements into raw mechanical musical symbolic representation Piano-Roll (Roll) for each video frame which represents the keys pressed at each time step. We then adapt the Roll to be amenable for audio synthesis by including temporal correlations. This step turns out to be critical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization
