PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
Ryan Grainger, Thomas Paniagua, Xi Song, Naresh Cuntoor, Mun Wai Lee,, Tianfu Wu

TL;DR
This paper introduces PaCa-ViT, a novel vision transformer that learns patch-to-cluster attention, reducing complexity and improving interpretability and performance across multiple vision tasks.
Contribution
It proposes a patch-to-cluster attention mechanism that replaces patch-to-patch attention, enabling linear complexity and better interpretability in Vision Transformers.
Findings
Outperforms prior models on ImageNet-1k, MS-COCO, and MIT-ADE20k benchmarks.
Achieves significant efficiency gains with linear complexity.
Produces semantically meaningful learned clusters.
Abstract
Vision Transformers (ViTs) are built on the assumption of treating image patches as ``visual tokens" and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout
