TL;DR
This paper introduces a multilingual speech correction method that combines disfluency detection, instruction fine-tuning of LLMs, and contrastive learning to improve transcript fluency across Indian languages.
Contribution
It presents a novel multilingual correction pipeline integrating token-level disfluency signals with instruction tuning and contrastive learning, outperforming existing models.
Findings
Consistent improvements over strong baselines in Hindi, Bengali, and Marathi.
Detection-only strategies are insufficient for effective disfluency correction.
Combining token cues with instruction tuning and contrastive learning enhances speech transcript quality.
Abstract
Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
