Analyzing Learned Representations of a Deep ASR Performance Prediction Model
Zied Elloumi, Laurent Besacier, Olivier Galibert, Benjamin Lecouteux

TL;DR
This paper analyzes the internal representations of a CNN-based ASR performance prediction model, revealing that it captures speech style, accent, and broadcast type, and demonstrates benefits of multi-task learning for improved prediction and tagging.
Contribution
It provides an in-depth analysis of speech and text embeddings in a CNN model for ASR performance prediction and introduces multi-task learning to enhance prediction and classification.
Findings
Hidden layers encode speech style, accent, and broadcast type.
Multi-task learning improves prediction efficiency.
The model can simultaneously predict ASR performance and classify utterance attributes.
Abstract
This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information at training time through multi-task learning. Our experiments show that this allows to train slightly more efficient ASR performance prediction systems that - in addition - simultaneously tag the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
