Group Generalized Mean Pooling for Vision Transformer

Byungsoo Ko; Han-Gyu Kim; Byeongho Heo; Sangdoo Yun; Sanghyuk Chun,; Geonmo Gu; Wonjae Kim

arXiv:2212.04114·cs.CV·December 9, 2022·5 cites

Group Generalized Mean Pooling for Vision Transformer

Byungsoo Ko, Han-Gyu Kim, Byeongho Heo, Sangdoo Yun, Sanghyuk Chun,, Geonmo Gu, Wonjae Kim

PDF

Open Access

TL;DR

This paper introduces Group Generalized Mean (GGeM) pooling for Vision Transformers, which improves feature aggregation by considering channel groups, leading to performance boosts across classification, retrieval, and multi-modal tasks.

Contribution

The paper proposes GGeM pooling, a novel channel grouping strategy for ViT that enhances feature aggregation and outperforms existing pooling methods.

Findings

01

GGeM improves classification accuracy on ImageNet-1K by 0.1%-0.7%.

02

GGeM outperforms existing pooling strategies in image retrieval tasks.

03

GGeM is simple to implement with minimal code modifications.

Abstract

Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Adam · Softmax · Layer Normalization · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Linear Layer