Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry
Haoyu Yun, Hamid Krim

TL;DR
This paper introduces a novel framework combining Vision Transformer with proximal tools to model global geometric relationships, improving feature representation and classification accuracy in computer vision tasks.
Contribution
It proposes integrating ViT with proximal methods to construct a tangent bundle for enhanced global geometric optimization of features.
Findings
Outperforms traditional ViT in classification accuracy
Enhances global feature alignment through tangent bundle modeling
Achieves better data distribution representation
Abstract
The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT's optimization remains confined to modeling local relationships within individual images, limiting its ability to capture the global geometric relationships between data points. To address this limitation, this paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach to enhance feature representation and classification performance. In this framework, ViT constructs the tangent bundle of the manifold through its self-attention mechanism, where each attention head corresponds to a tangent space, offering geometric representations from diverse local perspectives. Proximal iterations are then introduced to define…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
