Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Jhen-Ke Lin; Hao-Chien Lu; Chung-Chun Wang; Hong-Yun Lin; Berlin Chen

arXiv:2506.04076·cs.CL·July 28, 2025

Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Jhen-Ke Lin, Hao-Chien Lu, Chung-Chun Wang, Hong-Yun Lin, Berlin Chen

PDF

Open Access

TL;DR

This paper demonstrates that precise acoustic hesitation tagging significantly improves end-to-end speech recognition accuracy for verbatim transcription, especially in second-language speech, by fine-tuning Whisper models with a novel annotation scheme.

Contribution

It introduces a new annotation scheme for hesitations, called 'Extra', and shows that explicit acoustic hesitation labeling enhances ASR performance without external audio data.

Findings

01

Fine-tuning Whisper with 'Extra' scheme reduces WER by 11.3% relative.

02

Explicit hesitation tagging improves transcription accuracy for L2 speech.

03

Achieved 5.81% WER on challenge dataset with the proposed method.

Abstract

Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the "Extra" scheme yielded a 5.5% WER, an 11.3% relative improvement over the "Pure" scheme (6.2% WER). This demonstrates that explicit, realistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Voice and Speech Disorders