On the Robust Approximation of ASR Metrics
Abdul Waheed, Hanin Atwany, Rita Singh, Bhiksha Raj

TL;DR
This paper introduces a label-free method that uses multimodal embeddings and a proxy model to accurately approximate ASR metrics like WER and CER across diverse datasets, reducing reliance on costly ground truth labels.
Contribution
The paper presents a novel approach combining multimodal embeddings and proxy models to estimate ASR metrics without ground truth labels, improving accuracy and generalization.
Findings
Achieves single-digit absolute difference in metric approximation across datasets
Outperforms recent baseline by over 50% in accuracy
Works effectively across standard and in-the-wild testing conditions
Abstract
Recent advances in speech foundation models are largely driven by scaling both model size and data, enabling them to perform a wide range of tasks, including speech recognition. Traditionally, ASR models are evaluated using metrics like Word Error Rate (WER) and Character Error Rate (CER), which depend on ground truth labels. As a result of limited labeled data from diverse domains and testing conditions, the true generalization capabilities of these models beyond standard benchmarks remain unclear. Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. Our method utilizes multimodal embeddings in a unified space for speech and transcription representations, combined with a high-quality proxy model to compute proxy metrics. These features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Ultrasonics and Acoustic Wave Propagation
