One Whisper to Grade Them All
Nhan Phan, Anusha Porwal, Yaroslav Getman, Ekaterina Voskoboinik, Tam\'as Gr\'osz, Mikko Kurimo

TL;DR
This paper introduces an efficient end-to-end system for holistic automatic speaking assessment that processes multiple responses with a single Whisper encoder, outperforming text-based baselines and demonstrating high data efficiency.
Contribution
The novel architecture processes all test parts simultaneously with a single Whisper encoder and a lightweight aggregator, eliminating transcription and per-part models for scalable language assessment.
Findings
Achieved RMSE of 0.384, outperforming the baseline of 0.44.
Reduced training data requirement by 55.2%, maintaining high performance.
System is efficient with at most 168M parameters, enabling large-scale deployment.
Abstract
We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
