Annotation-Free MIDI-to-Audio Synthesis via Concatenative Synthesis and Generative Refinement
Osamu Take, Taketo Akama

TL;DR
CoSaRef is a novel MIDI-to-audio synthesis method that generates diverse, realistic instrumental tracks without requiring MIDI annotations, using concatenative synthesis and diffusion-based refinement for fine-grained control.
Contribution
It introduces a MIDI-to-audio synthesis approach that eliminates the need for MIDI annotations, enhancing diversity and control in generated audio.
Findings
Outperforms state-of-the-art MIDI-supervised methods in quality and control.
Enables detailed timbre and expression control via audio samples and MIDI design.
Produces realistic, expressive tracks with diverse instrument timbres.
Abstract
Recent MIDI-to-audio synthesis methods using deep neural networks have successfully generated high-quality, expressive instrumental tracks. However, these methods require MIDI annotations for supervised training, limiting the diversity of instrument timbres and expression styles in the output. We propose CoSaRef, a MIDI-to-audio synthesis method that does not require MIDI-audio paired datasets. CoSaRef first generates a synthetic audio track using concatenative synthesis based on MIDI input, then refines it with a diffusion-based deep generative model trained on datasets without MIDI annotations. This approach improves the diversity of timbres and expression styles. Additionally, it allows detailed control over timbres and expression through audio sample selection and extra MIDI design, similar to traditional functions in digital audio workstations. Experiments showed that CoSaRef could…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
