Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

Duygu Altinok

arXiv:2508.13376·cs.CL·August 20, 2025

Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

Duygu Altinok

PDF

TL;DR

This paper introduces a method to improve long speech transcript accuracy by distilling syntactic and semantic knowledge from LLaMA models into Whisper, enhancing downstream tasks like NER and punctuation.

Contribution

It presents a novel knowledge distillation approach that incorporates syntax and semantics into ASR, improving performance on long audio transcripts.

Findings

01

Significant reduction in Word Error Rate (WER)

02

Improved Named Entity Recognition (NER) accuracy

03

Enhanced punctuation and capitalization success

Abstract

ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.