Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts
Duygu Altinok

TL;DR
This paper introduces a novel method using large language models to transcribe disfluencies in speech, effectively handling imperfect textual hints and producing richly annotated transcripts for improved speech processing.
Contribution
It presents a new approach leveraging LLMs to generate disfluency-rich transcripts from imperfect, timestamped textual inputs, enhancing robustness in speech disfluency detection.
Findings
LLMs can effectively handle imperfect textual hints with timestamp cues.
The method produces fully annotated disfluency transcripts.
Robustness to input imperfections improves speech processing applications.
Abstract
Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models -- all of which may contain imperfections. Importantly, our experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
