A Deep Dive into the Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos
Anand Kumar Rai, Siddharth D Jaiswal, Animesh Mukherjee

TL;DR
This study analyzes the performance disparities of state-of-the-art ASR systems across diverse Indian demographics using a large dataset of NPTEL MOOC videos, highlighting the need for more inclusive speech recognition models.
Contribution
The paper introduces a large, diverse speech dataset from NPTEL MOOCs and evaluates ASR disparities across demographic and disciplinary traits, revealing significant biases.
Findings
Disparities exist based on gender, native region, age, and speech rate.
No disparity was found based on caste.
Significant disparity observed across different lecture disciplines.
Abstract
Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and dialogue systems
