CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

Martijn Bartelds; Ananjan Nandi; Moussa Koulako Bala Doumbouya; Dan Jurafsky; Tatsunori Hashimoto; Karen Livescu

arXiv:2502.01777·cs.LG·January 29, 2026

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

Martijn Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu

PDF

Open Access 10 Models 3 Reviews

TL;DR

This paper introduces CTC-DRO, a robust optimization method designed to improve multilingual speech recognition by reducing language disparities and addressing limitations of existing group DRO approaches.

Contribution

The paper proposes CTC-DRO, a novel method that smooths group weight updates and uses input length-matched batching to enhance robustness in speech recognition models.

Findings

01

CTC-DRO reduces worst-language error by up to 47.1%.

02

It outperforms group DRO and baseline models in multilingual ASR.

03

Minimal additional computational costs are required.

Abstract

Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss not only scales with input length but also varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR)…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The authors build a clear problem framing, pointing to a concrete issue in ASR, particularly between CTC loss geometry and group DRO weight updates, with a neat derivation showing why standard group-DRO weights can collapse to one group. The solution is well motivated, enough general, and theoretically principled. - The present compelling empirical evidence on a credible benchmark (ML-SUPERB 2.0), with DRO-CTC showing consistent improvements on worst-group CER across five language sets and

Weaknesses

- Robustness to hyperparameter alpha: While the paper discusses theoretically and reports practical ranges for alpha, it does not include a systematic sensitivity analysis. A more systematic hyperparameter sensitivity (including ηq and the batch-duration target) and early/late-training stability plots across sets would increase confidence in robustness and tune-free usability. - The work is limited to the scope of assuming languages as the sole group, relying on datasets where groups are well-d

Reviewer 02Rating 4Confidence 3

Strengths

The proposed algorithm appears to yield strong gains over baselines. Experiments are solid. The presentation is clear enough.

Weaknesses

The contribution is a bit incremental over an existing algorithm (Group DRO). Also, I have some questions about the compared baselines in this work, see the Questions section below.

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper addresses the language disparities in the multilingual ASR, which is critical. 2. The proposed modifications are simple, well-motivated, and easily adoptable in existing pipelines. 3. Results show consistent and sometimes substantial improvements over both CTC and Group DRO baselines, especially for worst-case languages.

Weaknesses

1. Experiments focus on small subsets (5–6 languages per set), with only limited scaling experiments. It is unclear how well the method generalizes to large-scale multilingual ASR (50–100+ languages). 2. The paper does not compare with alternative ASR+LID strategies, such as auxiliary CTC objectives [A] or condition-aware SSL representations [B], which directly integrate language identification and also improve the low-resource language performance. 3. Although an ablation study is included, it

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques