Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning
Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto

TL;DR
This paper introduces a novel speech annotation model that ensures phonemic and prosodic labels are consistent with graphemes by leveraging implicit and explicit conditioning methods, enhancing downstream task performance.
Contribution
The proposed model uniquely combines implicit prompt-based and explicit pruning techniques to improve grapheme-coherent speech label generation, surpassing previous fine-tuning approaches.
Findings
Significantly improved grapheme-label consistency.
Enhanced accent estimation accuracy using generated parallel data.
Applicable to various downstream speech tasks.
Abstract
We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Phonetics and Phonology Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout
