Calibration of Phone Likelihoods in Automatic Speech Recognition
David A. van Leeuwen, Joost van Doremalen

TL;DR
This paper investigates the calibration of phone likelihoods in DNN-based speech recognition, showing that averaging log likelihoods over phone durations and scaling by log duration improves calibration accuracy.
Contribution
It introduces a method to evaluate and improve the calibration of phone likelihoods in DNN acoustic models using duration-based averaging and scaling.
Findings
Averaging log likelihoods over phone duration enhances calibration.
Scaling by the logarithm of duration further improves calibration.
Calibration improvements are consistent on independent test data.
Abstract
In this paper we study the probabilistic properties of the posteriors in a speech recognition system that uses a deep neural network (DNN) for acoustic modeling. We do this by reducing Kaldi's DNN shared pdf-id posteriors to phone likelihoods, and using test set forced alignments to evaluate these using a calibration sensitive metric. Individual frame posteriors are in principle well-calibrated, because the DNN is trained using cross entropy as the objective function, which is a proper scoring rule. When entire phones are assessed, we observe that it is best to average the log likelihoods over the duration of the phone. Further scaling of the average log likelihoods by the logarithm of the duration slightly improves the calibration, and this improvement is retained when tested on independent test data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
