Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

Nenad Banfic; David Fan; Kunal Vaishnavi; Sam Kemp; Sunghoon Choi; Rui Ren; Sayan Shaw; Meng Tang

arXiv:2604.14493·cs.AI·April 21, 2026

Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

Nenad Banfic, David Fan, Kunal Vaishnavi, Sam Kemp, Sunghoon Choi, Rui Ren, Sayan Shaw, Meng Tang

PDF

TL;DR

This paper systematically benchmarks and optimizes on-device streaming ASR models, achieving high accuracy and low latency on CPU with significant model size reduction.

Contribution

It identifies NVIDIA Nemotron Speech Streaming as the best architecture for resource-constrained real-time English ASR and demonstrates effective quantization strategies to reduce model size.

Findings

01

NVIDIA Nemotron Speech Streaming outperforms other architectures in resource-limited settings.

02

Quantization reduces model size from 2.47 GB to 0.67 GB with minimal WER impact.

03

The int4 k-quant model achieves 8.20% WER and runs faster than real-time on CPU.

Abstract

Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic empirical study of state-of-the-art ASR architectures, encompassing encoder-decoder, transducer, and LLM-based paradigms, evaluated across batch, chunked, and streaming inference modes. Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA's Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implement the complete streaming inference pipeline in ONNX Runtime and conduct a controlled evaluation of multiple post-training quantization strategies, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.