Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens

Kazuki Yamauchi; Masato Murata; Shogo Seki

arXiv:2601.12254·cs.SD·January 21, 2026

Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens

Kazuki Yamauchi, Masato Murata, Shogo Seki

PDF

Open Access

TL;DR

This paper introduces a confidence-based filtering method using token log-probabilities to detect hallucination errors in generative speech enhancement models, improving dataset quality for TTS applications.

Contribution

The paper presents a novel non-intrusive filtering approach that leverages token confidence scores to identify hallucination errors in GSE models, outperforming traditional quality metrics.

Findings

01

Confidence scores correlate with intrusive speech quality metrics.

02

The method detects errors missed by conventional filtering.

03

Filtering improves TTS model performance.

Abstract

Generative speech enhancement (GSE) models show great promise in producing high-quality clean speech from noisy inputs, enabling applications such as curating noisy text-to-speech (TTS) datasets into high-quality ones. However, GSE models are prone to hallucination errors, such as phoneme omissions and speaker inconsistency, which conventional error filtering based on non-intrusive speech quality metrics often fails to detect. To address this issue, we propose a non-intrusive method for filtering hallucination errors from discrete token-based GSE models. Our method leverages the log-probabilities of generated tokens as confidence scores to detect potential errors. Experimental results show that the confidence scores strongly correlate with a suite of intrusive SE metrics, and that our method effectively identifies hallucination errors missed by conventional filtering methods.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis