WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

Erfan Ramezani; Mohammad Mahdi Giahi; Mohammad Erfan Zarabadipour; Amir Reza Yosefian; Hamid Ghadiri

arXiv:2604.25611·cs.CL·April 29, 2026

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

Erfan Ramezani, Mohammad Mahdi Giahi, Mohammad Erfan Zarabadipour, Amir Reza Yosefian, Hamid Ghadiri

PDF

TL;DR

WhisperPipe is a streaming ASR architecture that reduces memory and latency while maintaining high transcription accuracy, enabling efficient real-time speech recognition on resource-constrained devices.

Contribution

It introduces a hybrid VAD, dynamic buffering, and adaptive processing to achieve bounded memory use and low latency without sacrificing accuracy.

Findings

01

Median latency of 89ms with 48% less GPU memory usage.

02

Maintains stable memory over 150-minute continuous operation.

03

Achieves near-offline Whisper accuracy with 3-5x lower latency.

Abstract

Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.