Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu; Jixuan Ying; Qixiu Li; Tianzhu Ye; Dongchen Han; Xiaochen Wang; Ziyi Wang; Xinyu Shao; Gao Huang; Xiu Li

arXiv:2511.00833·cs.CV·November 4, 2025

Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li

PDF

Open Access

TL;DR

This paper introduces Visual-Contrast Attention (VCA), a novel module for Vision Transformers that enhances discrimination, reduces computational complexity, and improves accuracy and image generation quality without extra FLOPs.

Contribution

VCA provides a simple, architecture-agnostic replacement for MHSA that injects explicit discrimination and reduces complexity, leading to better performance in recognition and generation tasks.

Findings

01

VCA improves DeiT-Tiny accuracy by 3.4% on ImageNet-1K.

02

VCA reduces FID scores in image generation models by up to 5.2 points.

03

Extensive ablations confirm the effectiveness of spatial pooling and dual positional embeddings.

Abstract

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications