PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Ryan Grainger; Thomas Paniagua; Xi Song; Naresh Cuntoor; Mun Wai Lee,; Tianfu Wu

arXiv:2203.11987·cs.CV·April 10, 2023

PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Ryan Grainger, Thomas Paniagua, Xi Song, Naresh Cuntoor, Mun Wai Lee,, Tianfu Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces PaCa-ViT, a novel vision transformer that learns patch-to-cluster attention, reducing complexity and improving interpretability and performance across multiple vision tasks.

Contribution

It proposes a patch-to-cluster attention mechanism that replaces patch-to-patch attention, enabling linear complexity and better interpretability in Vision Transformers.

Findings

01

Outperforms prior models on ImageNet-1k, MS-COCO, and MIT-ADE20k benchmarks.

02

Achieves significant efficiency gains with linear complexity.

03

Produces semantically meaningful learned clusters.

Abstract

Vision Transformers (ViTs) are built on the assumption of treating image patches as ``visual tokens" and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ivmcl/pacavit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout