# OLMoASR: Open Models and Data for Training Robust Speech Recognition Models

**Authors:** Huong Ngo, Matt Deitke, Martijn Bartelds, Sarah Pratt, Josh Gardner, Matt Jordan, Ludwig Schmidt

arXiv: 2508.20869 · 2025-08-29

## TL;DR

This paper introduces OLMoASR, a large-scale dataset and series of models for robust zero-shot speech recognition, demonstrating competitive performance with existing models like Whisper across various benchmarks.

## Contribution

The paper presents a new high-quality dataset, OLMoASR-Mix, and a suite of models trained on it, advancing zero-shot speech recognition capabilities.

## Key findings

- OLMoASR models achieve performance comparable to OpenAI's Whisper.
- High-quality dataset improves speech recognition robustness.
- Models perform well on both short and long-form speech benchmarks.

## Abstract

Improvements in training data scale and quality have led to significant advances, yet its influence in speech recognition remains underexplored. In this paper, we present a large-scale dataset, OLMoASR-Pool, and series of models, OLMoASR, to study and develop robust zero-shot speech recognition models. Beginning from OLMoASR-Pool, a collection of 3M hours of English audio and 17M transcripts, we design text heuristic filters to remove low-quality or mistranscribed data. Our curation pipeline produces a new dataset containing 1M hours of high-quality audio-transcript pairs, which we call OLMoASR-Mix. We use OLMoASR-Mix to train the OLMoASR-Mix suite of models, ranging from 39M (tiny.en) to 1.5B (large.en) parameters. Across all model scales, OLMoASR achieves comparable average performance to OpenAI's Whisper on short and long-form speech recognition benchmarks. Notably, OLMoASR-medium.en attains a 12.8\% and 11.0\% word error rate (WER) that is on par with Whisper's largest English-only model Whisper-medium.en's 12.4\% and 10.5\% WER for short and long-form recognition respectively (at equivalent parameter count). OLMoASR-Pool, OLMoASR models, and filtering, training and evaluation code will be made publicly available to further research on robust speech processing.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20869/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20869/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/2508.20869/full.md

---
Source: https://tomesphere.com/paper/2508.20869