Whispy: Adapting STT Whisper Models to Real-Time Environments
Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro, Romano

TL;DR
Whispy is a system that adapts Whisper models for real-time speech transcription, achieving low latency and high accuracy through architectural optimizations, enabling practical live speech analysis applications.
Contribution
This paper introduces Whispy, a novel system that enables real-time transcription with Whisper models by optimizing architecture for low latency and high accuracy.
Findings
Whispy maintains high transcription accuracy in real-time settings.
The system demonstrates robustness across diverse speech datasets.
Whispy achieves low computational cost suitable for practical deployment.
Abstract
Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this paper, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available speech datasets, investigating how the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics in Business and Education
