Statistical Testing on ASR Performance via Blockwise Bootstrap
Zhe Liu, Fuchun Peng

TL;DR
This paper introduces a blockwise bootstrap method for statistically evaluating ASR performance differences, accounting for data dependencies like speaker correlation, and demonstrates its effectiveness on synthetic and real speech data.
Contribution
It proposes a novel blockwise bootstrap approach for more reliable significance testing in ASR evaluations with dependent data.
Findings
Blockwise bootstrap provides consistent variance estimates.
Method is validated on synthetic data.
Method is effective on real-world speech data.
Abstract
A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence interval (CI) can be utilized to tell whether this improvement is real or only due to random chance. The bootstrap resampling method has been popular for such significance analysis which is intuitive and easy to use. However, this method fails in dealing with dependent data, which is prevalent in speech world - for example, ASR performance on utterances from the same speaker could be correlated. In this paper we present blockwise bootstrap approach - by dividing evaluation utterances into nonoverlapping blocks, this method resamples these blocks instead of original data. We show that the resulting variance estimator of absolute WER difference between two ASR systems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Algorithms and Data Compression · Advanced Data Compression Techniques
