Exploring RWKV for Memory Efficient and Low Latency Streaming ASR
Keyu An, Shiliang Zhang

TL;DR
This paper explores the application of RWKV, a linear attention transformer variant, to streaming ASR, demonstrating comparable or better accuracy with reduced latency and memory usage compared to traditional chunk conformer models.
Contribution
The paper introduces RWKV for streaming ASR, combining transformer performance with RNN-like efficiency, suitable for low-latency, memory-constrained environments.
Findings
RWKV-Transducer achieves comparable accuracy to chunk conformer transducer.
RWKV models demonstrate minimal latency and inference memory cost.
Experiments on datasets from 100h to 10000h show scalability and effectiveness.
Abstract
Recently, self-attention-based transformers and conformers have been introduced as alternatives to RNNs for ASR acoustic modeling. Nevertheless, the full-sequence attention mechanism is non-streamable and computationally expensive, thus requiring modifications, such as chunking and caching, for efficient streaming ASR. In this paper, we propose to apply RWKV, a variant of linear attention transformer, to streaming ASR. RWKV combines the superior performance of transformers and the inference efficiency of RNNs, which is well-suited for streaming ASR scenarios where the budget for latency and memory is restricted. Experiments on varying scales (100h - 10000h) demonstrate that RWKV-Transducer and RWKV-Boundary-Aware-Transducer achieve comparable to or even better accuracy compared with chunk conformer transducer, with minimal latency and inference memory cost.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Underwater Acoustics Research
