Learning the RoPEs: Better 2D and 3D Position Encodings with STRING
Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua, Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan, Wahid, Sumeet Singh, Ren\'e Wagner, Tianli Ding, Chuyuan Fu, Arunkumar, Byravan, Jake Varley, Alexey Gritsenko, Matthias Minderer

TL;DR
This paper introduces STRING, a novel position encoding method that extends Rotary Encodings to 2D and 3D, offering exact translation invariance and efficiency, with significant improvements in vision and robotics applications.
Contribution
STRING provides a unifying theoretical framework for position encodings, enabling exact translation invariance in multiple dimensions while maintaining low computational costs.
Findings
Substantial gains in open-vocabulary object detection.
Improved performance in robotics controllers.
Theoretical proof of universality of STRING.
Abstract
We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Constraint Satisfaction and Optimization
