Singing Synthesis: with a little help from my attention
Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

TL;DR
UTACO introduces an attention-based singing synthesis model that improves naturalness without explicit duration or pitch input, demonstrating the effective application of sequence-to-sequence models in singing synthesis.
Contribution
This work applies attention-based sequence-to-sequence models to singing synthesis, reducing the need for explicit voice feature modeling and achieving higher naturalness.
Findings
Improves naturalness over previous models
Learns vibrato autonomously from musical context
Reduces explicit duration and pitch modeling requirements
Abstract
We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis field and improves naturalness over the state of the art. The system requires considerably less explicit modelling of voice features such as F0 patterns, vibratos, and note and phoneme durations, than previous models in the literature. Despite this, it shows a strong improvement in naturalness with respect to previous neural singing synthesis models. The model does not require any durations or pitch patterns as inputs, and learns to insert vibrato autonomously according to the musical context.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
