Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions
Duygu Altinok

TL;DR
This paper introduces a context-aware training method for ASR systems that enhances entity recognition and formatting by using overlapping context windows and entity labeling, significantly improving performance on long-form transcriptions.
Contribution
The paper proposes a novel training approach with overlapping context windows and entity labels to improve entity recognition and formatting in ASR systems.
Findings
Improved named entity recognition accuracy.
Enhanced entity formatting in transcriptions.
Better semantic understanding in long-form speech recognition.
Abstract
Automatic Speech Recognition (ASR) systems, such as Whisper, achieve high transcription accuracy but struggle with named entities and numerical data, especially when proper formatting is required. These issues increase word error rate (WER) and impair semantic understanding in critical domains like legal, financial, and medical applications. We propose a novel training approach that extends the semantic context of ASR models by adding overlapping context windows during training. By sliding 5-second overlaps on both sides of 30-second chunks, we create a 40-second "effective semantic window," improving entity recognition and formatting while focusing predictions on the central 30 seconds. To address entities spanning chunk boundaries, we reassign such entities entirely to the right-hand chunk, ensuring proper formatting. Additionally, enriched training data with embedded entity labels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
