Benchmarking Rotary Position Embeddings for Automatic Speech Recognition
Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

TL;DR
This paper evaluates Rotary Positional Embeddings (RoPE) in automatic speech recognition, demonstrating comparable or better accuracy than relative position embeddings with reduced training time across diverse speech tasks.
Contribution
It provides the first comprehensive assessment of RoPE in ASR, showing its efficiency and effectiveness compared to traditional relative position embeddings.
Findings
RoPE achieves similar or better error rates than RelPos in ASR tasks.
Training time is reduced by up to 21% using RoPE.
RoPE is effective across various speech types and settings.
Abstract
Self-attention relies on positional embeddings to encode input order. Relative Position (RelPos) embeddings are widely used in Automatic Speech Recognition (ASR). However, RelPos has quadratic time complexity to input length and is often incompatible with fast GPU implementations of attention. In contrast, Rotary Positional Embedding (RoPE) rotates each input vector based on its absolute position, taking linear time to sequence length, implicitly encoding relative distances through self-attention dot products. Thus, it is usually compatible with efficient attention. However, its use in ASR remains underexplored. This work evaluates RoPE across diverse ASR tasks with training data ranging from 100 to 50,000 hours, covering various speech types (read, spontaneous, clean, noisy) and different accents in both streaming and non-streaming settings. ASR error rates are similar or better than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
