Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Kush Juvekar; Kavya Manohar; Aditya Srinivas Menon; Arghya Bhattacharya; Kumarmanas Nethil

arXiv:2605.13087·cs.CL·May 14, 2026

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil

PDF

8 Models

TL;DR

Vividh-ASR introduces a complexity-stratified benchmark for Indian languages and proposes R-MFT, a training method that enhances multilingual ASR performance while maintaining model efficiency.

Contribution

The paper presents Vividh-ASR benchmark and R-MFT training recipe, improving low-resource language speech recognition and understanding model adaptation dynamics.

Findings

01

Early large parameter updates improve global WER by 12 points.

02

A hard-to-easy curriculum adds gains for spontaneous speech.

03

R-MFT enables smaller models to match larger fine-tuned models.

Abstract

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.