Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li

TL;DR
This paper introduces Visual-Contrast Attention (VCA), a novel module for Vision Transformers that enhances discrimination, reduces computational complexity, and improves accuracy and image generation quality without extra FLOPs.
Contribution
VCA provides a simple, architecture-agnostic replacement for MHSA that injects explicit discrimination and reduces complexity, leading to better performance in recognition and generation tasks.
Findings
VCA improves DeiT-Tiny accuracy by 3.4% on ImageNet-1K.
VCA reduces FID scores in image generation models by up to 5.2 points.
Extensive ablations confirm the effectiveness of spatial pooling and dual positional embeddings.
Abstract
Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
