Group Generalized Mean Pooling for Vision Transformer
Byungsoo Ko, Han-Gyu Kim, Byeongho Heo, Sangdoo Yun, Sanghyuk Chun,, Geonmo Gu, Wonjae Kim

TL;DR
This paper introduces Group Generalized Mean (GGeM) pooling for Vision Transformers, which improves feature aggregation by considering channel groups, leading to performance boosts across classification, retrieval, and multi-modal tasks.
Contribution
The paper proposes GGeM pooling, a novel channel grouping strategy for ViT that enhances feature aggregation and outperforms existing pooling methods.
Findings
GGeM improves classification accuracy on ImageNet-1K by 0.1%-0.7%.
GGeM outperforms existing pooling strategies in image retrieval tasks.
GGeM is simple to implement with minimal code modifications.
Abstract
Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Adam · Softmax · Layer Normalization · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Linear Layer
