CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages

Yuma Shirahata; Ryuichi Yamamoto

arXiv:2602.17157·eess.AS·February 20, 2026

CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages

Yuma Shirahata, Ryuichi Yamamoto

PDF

Open Access

TL;DR

CC-G2PnP is a streaming model that connects language models and text-to-speech for unsegmented languages, using Conformer-CTC to predict phonemes and prosody efficiently in real-time.

Contribution

It introduces a novel streaming G2PnP model based on Conformer-CTC that handles unsegmented languages without relying on explicit word boundaries.

Findings

01

Outperforms baseline in phoneme and prosody prediction accuracy

02

Effective for unsegmented languages like Japanese

03

Enables stable streaming inference with minimal look-ahead

Abstract

We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling