SegINR: Segment-wise Implicit Neural Representation for Sequence   Alignment in Neural Text-to-Speech

Minchan Kim; Myeonghun Jeong; Joun Yeop Lee; Nam Soo Kim

arXiv:2410.04690·eess.AS·October 22, 2024

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim

PDF

Open Access

TL;DR

SegINR introduces a segment-wise implicit neural representation for sequence alignment in neural TTS, eliminating the need for duration predictors and complex frame-level modeling, leading to improved speech quality and efficiency.

Contribution

It proposes a novel segment-wise INR method that models temporal dynamics and defines segment boundaries automatically, simplifying neural TTS pipeline.

Findings

01

Outperforms conventional methods in zero-shot adaptive TTS

02

Achieves higher speech quality with lower computational costs

03

Effectively models temporal dynamics within speech segments

Abstract

We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequences directly into frame-level features. It leverages an optimal text encoder to extract embeddings, transforming each into a segment of frame-level features using a conditional implicit neural representation (INR). This method, named segment-wise INR (SegINR), models temporal dynamics within each segment and autonomously defines segment boundaries, reducing computational costs. We integrate SegINR into a two-stage TTS framework, using it for semantic token prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate that SegINR outperforms conventional methods in speech quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis