Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation

Rui Hu; Xiaolong Lin; Jiawang Liu; Shixi Huang; Zhenpeng Zhan

arXiv:2506.07646·cs.CL·June 10, 2025

Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation

Rui Hu, Xiaolong Lin, Jiawang Liu, Shixi Huang, Zhenpeng Zhan

PDF

Open Access

TL;DR

This paper introduces a novel method for annotating Japanese speech data by fine-tuning a pre-trained ASR model with dictionary-enhanced decoding, improving label accuracy and TTS naturalness.

Contribution

It presents a new approach combining transcript-conditioned ASR fine-tuning with dictionary-based decoding for improved phonemic and prosodic annotation in Japanese speech datasets.

Findings

01

Outperforms previous text- or audio-only annotation methods.

02

Achieves comparable speech naturalness to manual annotations in TTS.

03

Demonstrates improved annotation accuracy through dictionary-augmented decoding.

Abstract

In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. Our approach involves fine-tuning a large-scale pre-trained automatic speech recognition (ASR) model, conditioned on ground truth transcripts, to simultaneously output phrase-level graphemes and annotation labels. To further correct errors in phonemic labeling, we employ a decoding strategy that utilizes dictionary prior knowledge. The objective evaluation results demonstrate that our proposed method outperforms previous approaches relying solely on text or audio. The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research