Benchmarking Rotary Position Embeddings for Automatic Speech Recognition

Shucong Zhang; Titouan Parcollet; Rogier van Dalen; Sourav Bhattacharya

arXiv:2501.06051·cs.CL·June 17, 2025

Benchmarking Rotary Position Embeddings for Automatic Speech Recognition

Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

PDF

Open Access

TL;DR

This paper evaluates Rotary Positional Embeddings (RoPE) in automatic speech recognition, demonstrating comparable or better accuracy than relative position embeddings with reduced training time across diverse speech tasks.

Contribution

It provides the first comprehensive assessment of RoPE in ASR, showing its efficiency and effectiveness compared to traditional relative position embeddings.

Findings

01

RoPE achieves similar or better error rates than RelPos in ASR tasks.

02

Training time is reduced by up to 21% using RoPE.

03

RoPE is effective across various speech types and settings.

Abstract

Self-attention relies on positional embeddings to encode input order. Relative Position (RelPos) embeddings are widely used in Automatic Speech Recognition (ASR). However, RelPos has quadratic time complexity to input length and is often incompatible with fast GPU implementations of attention. In contrast, Rotary Positional Embedding (RoPE) rotates each input vector based on its absolute position, taking linear time to sequence length, implicitly encoding relative distances through self-attention dot products. Thus, it is usually compatible with efficient attention. However, its use in ASR remains underexplored. This work evaluates RoPE across diverse ASR tasks with training data ranging from 100 to 50,000 hours, covering various speech types (read, spontaneous, clean, noisy) and different accents in both streaming and non-streaming settings. ASR error rates are similar or better than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing