Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions
Zhihang Xin, Xitong Hu, Rui Wang

TL;DR
This paper introduces WEF-PE, a geometrically grounded positional encoding for vision transformers using elliptic functions, improving spatial relationship modeling and overall performance in vision tasks.
Contribution
It presents a novel elliptic function-based positional encoding that captures 2D spatial relationships more effectively than traditional methods.
Findings
Achieves 63.78% accuracy on CIFAR-100 from scratch with ViT-Tiny.
Improves performance on CIFAR-100 fine-tuning with ViT-Base.
Enhances results on VTAB-1k benchmark tasks.
Abstract
Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model's capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
