Singing Voice Synthesis Based on a Musical Note Position-Aware Attention   Mechanism

Yukiya Hono; Kei Hashimoto; Yoshihiko Nankaku; Keiichi Tokuda

arXiv:2212.13703·eess.AS·March 16, 2023·1 cites

Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism

Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

PDF

Open Access

TL;DR

This paper introduces a novel seq2seq singing voice synthesis model with a musical note position-aware attention mechanism that improves naturalness and timing robustness by incorporating rhythm information from musical scores.

Contribution

It presents a new attention mechanism that explicitly considers musical note positions, enhancing the robustness and naturalness of singing voice synthesis.

Findings

01

Improved naturalness of synthesized singing voices.

02

Enhanced robustness in temporal modeling of singing voices.

03

Effective incorporation of musical score rhythm into attention mechanism.

Abstract

This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model still rely on explicitly on duration information generated by additional modules. Although some studies perform simultaneous modeling using seq2seq models with an attention mechanism, they have insufficient robustness against temporal modeling. The proposed attention mechanism is designed to estimate the attention weights by considering the rhythm given by the musical score. Furthermore, several techniques are also introduced to improve the modeling performance of the singing voice. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence