WhisperKit: On-device Real-time ASR with Billion-Scale Transformers

Atila Orhon; Arda Okan; Berkin Durmus; Zach Nagengast; Eduardo Pacheco

arXiv:2507.10860·cs.SD·July 16, 2025

WhisperKit: On-device Real-time ASR with Billion-Scale Transformers

Atila Orhon, Arda Okan, Berkin Durmus, Zach Nagengast, Eduardo Pacheco

PDF

Open Access

TL;DR

WhisperKit is an on-device real-time ASR system that achieves high accuracy and low latency, outperforming leading cloud-based systems and suitable for various commercial applications.

Contribution

The paper introduces WhisperKit, an optimized on-device inference system for real-time ASR with significant performance improvements over existing cloud-based solutions.

Findings

01

WhisperKit achieves 0.46s latency and 2.2% WER.

02

It outperforms leading cloud-based ASR systems.

03

The system is suitable for real-time commercial applications.

Abstract

Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most important factors when companies select a system to deploy. We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems. We benchmark against server-side systems that deploy a diverse set of models, including a frontier model (OpenAI gpt-4o-transcribe), a proprietary model (Deepgram nova-3), and an open-source model (Fireworks large-v3-turbo).Our results show that WhisperKit matches the lowest latency at 0.46s while achieving the highest accuracy 2.2% WER. The optimizations behind the WhisperKit system are described in detail in this paper.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEEG and Brain-Computer Interfaces · Advanced Memory and Neural Computing · Blind Source Separation Techniques

MethodsSparse Evolutionary Training