Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

Haoyu Yun; Hamid Krim

arXiv:2508.17081·cs.CV·August 26, 2025

Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

Haoyu Yun, Hamid Krim

PDF

TL;DR

This paper introduces a novel framework combining Vision Transformer with proximal tools to model global geometric relationships, improving feature representation and classification accuracy in computer vision tasks.

Contribution

It proposes integrating ViT with proximal methods to construct a tangent bundle for enhanced global geometric optimization of features.

Findings

01

Outperforms traditional ViT in classification accuracy

02

Enhances global feature alignment through tangent bundle modeling

03

Achieves better data distribution representation

Abstract

The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT's optimization remains confined to modeling local relationships within individual images, limiting its ability to capture the global geometric relationships between data points. To address this limitation, this paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach to enhance feature representation and classification performance. In this framework, ViT constructs the tangent bundle of the manifold through its self-attention mechanism, where each attention head corresponds to a tangent space, offering geometric representations from diverse local perspectives. Proximal iterations are then introduced to define…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.