Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems
Karel Bene\v{s}, Martin Kocour, Luk\'a\v{s} Burget

TL;DR
Hystoc is a novel method that derives well-calibrated word confidences from end-to-end speech recognition hypotheses, improving fusion performance and accuracy estimation.
Contribution
Hystoc introduces an iterative alignment approach to extract word confidences from hypothesis scores, enhancing system fusion and confidence calibration.
Findings
Hystoc produces confidences correlating with hypothesis accuracy.
Fusion with Hystoc yields up to 1% WER improvement on Spanish RTVE2020.
Limited gains when fusing very similar systems using Hystoc.
Abstract
End-to-end (e2e) systems have recently gained wide popularity in automatic speech recognition. However, these systems do generally not provide well-calibrated word-level confidences. In this paper, we propose Hystoc, a simple method for obtaining word-level confidences from hypothesis-level scores. Hystoc is an iterative alignment procedure which turns hypotheses from an n-best output of the ASR system into a confusion network. Eventually, word-level confidences are obtained as posterior probabilities in the individual bins of the confusion network. We show that Hystoc provides confidences that correlate well with the accuracy of the ASR hypothesis. Furthermore, we show that utilizing Hystoc in fusion of multiple e2e ASR systems increases the gains from the fusion by up to 1\,\% WER absolute on Spanish RTVE2020 dataset. Finally, we experiment with using Hystoc for direct fusion of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
