A Circular Argument : Does RoPE need to be Equivariant for Vision?
Chase van de Geijn, Timo L\"uddecke, Polina Turishcheva, Alexander S. Ecker

TL;DR
This paper investigates the importance of equivariance in Rotary Positional Encodings for vision, showing that non-equivariant variants can perform equally well or better, challenging common assumptions.
Contribution
It introduces Spherical RoPE, a non-equivariant positional encoding, and demonstrates its effectiveness, questioning the necessity of equivariance in vision tasks.
Findings
Spherical RoPE matches or outperforms equivariant variants in vision tasks.
Mathematical analysis shows RoPE's generality for equivariant embeddings.
Non-equivariant encodings can be faster and more flexible for vision applications.
Abstract
Rotary Positional Encodings (RoPE) have emerged as a highly effective technique for one-dimensional sequences in Natural Language Processing spurring recent progress towards generalizing RoPE to higher-dimensional data such as images and videos. The success of RoPE has been thought to be due to its positional equivariance, i.e. its status as a relative positional encoding. In this paper, we mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data. Moreover, we show Mixed RoPE to be the analogously general solution for M-dimensional data, if we require commutative generators -- a property necessary for RoPE's equivariance. However, we question whether strict equivariance plays a large role in RoPE's performance. We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Face recognition and analysis
