TL;DR
Parabolic Position Encoding (PaPE) is a novel, vision-centric position encoding method based on parabola principles, designed to improve extrapolation and generality across various vision modalities.
Contribution
We introduce PaPE, a principled parabola-based position encoding that captures vision modality characteristics and demonstrates superior extrapolation and generality in vision tasks.
Findings
PaPE significantly outperforms existing encodings in extrapolation on ImageNet-1K.
PaPE matches or exceeds baseline performance across 8 datasets and 4 modalities.
PaPE's design incorporates translation invariance, rotation invariance, and context awareness.
Abstract
We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
