Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism
Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

TL;DR
This paper introduces a novel seq2seq singing voice synthesis model with a musical note position-aware attention mechanism that improves naturalness and timing robustness by incorporating rhythm information from musical scores.
Contribution
It presents a new attention mechanism that explicitly considers musical note positions, enhancing the robustness and naturalness of singing voice synthesis.
Findings
Improved naturalness of synthesized singing voices.
Enhanced robustness in temporal modeling of singing voices.
Effective incorporation of musical score rhythm into attention mechanism.
Abstract
This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model still rely on explicitly on duration information generated by additional modules. Although some studies perform simultaneous modeling using seq2seq models with an attention mechanism, they have insufficient robustness against temporal modeling. The proposed attention mechanism is designed to estimate the attention weights by considering the rhythm given by the musical score. Furthermore, several techniques are also introduced to improve the modeling performance of the singing voice. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
