Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
Duygu Altinok

TL;DR
This paper introduces a method to improve long speech transcript accuracy by distilling syntactic and semantic knowledge from LLaMA models into Whisper, enhancing downstream tasks like NER and punctuation.
Contribution
It presents a novel knowledge distillation approach that incorporates syntax and semantics into ASR, improving performance on long audio transcripts.
Findings
Significant reduction in Word Error Rate (WER)
Improved Named Entity Recognition (NER) accuracy
Enhanced punctuation and capitalization success
Abstract
ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
