Singing voice synthesis based on frame-level sequence-to-sequence models   considering vocal timing deviation

Miku Nishihara; Yukiya Hono; Kei Hashimoto; Yoshihiko Nankaku; and; Keiichi Tokuda

arXiv:2301.02262·eess.AS·February 23, 2023·1 cites

Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation

Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, and, Keiichi Tokuda

PDF

Open Access

TL;DR

This paper introduces a frame-level sequence-to-sequence singing voice synthesis model that incorporates an attention mechanism to better handle vocal timing deviations, improving synchronization and sound quality.

Contribution

It proposes a novel attention-based approach at frame-level to mitigate alignment errors caused by external phoneme boundary aligners in singing voice synthesis.

Findings

01

The attention mechanism effectively absorbs alignment errors.

02

The system performs well even with heuristic pseudo-phoneme boundaries.

03

Experimental results demonstrate improved synthesis quality.

Abstract

This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account vocal timing deviations. Therefore, the sound quality is affected by the aligner accuracy in this system. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing