WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition
Erfan Ramezani, Mohammad Mahdi Giahi, Mohammad Erfan Zarabadipour, Amir Reza Yosefian, Hamid Ghadiri

TL;DR
WhisperPipe is a streaming ASR architecture that reduces memory and latency while maintaining high transcription accuracy, enabling efficient real-time speech recognition on resource-constrained devices.
Contribution
It introduces a hybrid VAD, dynamic buffering, and adaptive processing to achieve bounded memory use and low latency without sacrificing accuracy.
Findings
Median latency of 89ms with 48% less GPU memory usage.
Maintains stable memory over 150-minute continuous operation.
Achieves near-offline Whisper accuracy with 3-5x lower latency.
Abstract
Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
