Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Hoyeon Lee; Sejung Son; Ye-Eun Kang; Jong-Hwan Kim

arXiv:2507.18044·cs.CL·July 25, 2025

Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim

PDF

Open Access

TL;DR

This paper investigates using large language models to generate synthetic phrase break annotations, aiming to reduce manual effort and improve data quality in speech prosody prediction across multiple languages.

Contribution

It introduces a novel approach of leveraging LLMs for synthetic data generation in phrase break prediction, demonstrating its effectiveness compared to traditional annotations.

Findings

01

LLM-generated data reduces manual annotation effort.

02

Synthetic data improves phrase break prediction accuracy.

03

Method is effective across multiple languages.

Abstract

Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling