MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
Maike Behrendt, Stefan Sylvius Wagner, Stefan Harmeling

TL;DR
MaxPoolBERT introduces lightweight layer- and token-wise aggregation methods to improve BERT's classification accuracy, especially in low-resource settings, without extensive retraining.
Contribution
It proposes novel aggregation techniques—max-pooling and multi-head attention over tokens and layers—to enhance BERT's representations for classification tasks.
Findings
MaxPoolBERT outperforms standard BERT on GLUE low-resource tasks.
Aggregation methods improve classification accuracy without retraining.
No significant increase in model size or pre-training required.
Abstract
The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we study lightweight extensions to BERT that refine the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach, called MaxPoolBERT, enhances BERT's classification accuracy (especially on low-resource tasks) without requiring new pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is straightforward, though it could benefit from more in-depth analysis.
1. Several small evaluation sets, like RTE, show high variance that is higher than the performance gain reported in this paper. 2. This paper does not compare to other papers' improvements over BERT, but only compares to the Valina BERT. 3. The results of the BERT baseline reported in this paper are lower than those of others. 4. In the final layer, Cls token already attend to all tokens, and passes infromation from past layers through the residual connection, why do we need addtional sett
- The overall idea is simple and straightforward. - The paper is easy to follow. The proposed method is explained in detail and is easy to reproduce.
- The results are only evaluated on BERT and RoBERTa, with testing conducted exclusively on the GLUE benchmark. - The idea of enhancing the [CLS] token has been extensively studied in the field of Vision Transformers (ViTs) and their variants. However, due to the variable input lengths of NLP tasks, different strategies may exhibit performance variations across distinct tasks. It would be valuable to further evaluate the proposed method on additional benchmarks. - I believe that conducting exper
1.Good Clarity and Motivation. The paper is easy to read and the motivation is very clear. Why we need to improve the [CLS] token representation is well explained with support from other papers. 2.Focus on Low-Resource and Stability. For the BERT-base model, the results are strong. On some small datasets like MRPC and RTE, the improvement is quite big, which shows the method has some real effect in low-resource situations. They also show that their model has a smaller standard deviation across
1.Failure to Generalize. This is the most serious weakness. The method improves BERT-base but slightly hurts RoBERTa-base. This strongly suggests that it is not a general method for improving similar models. It is more like a "patch" that fixes a specific weakness in the original BERT-base model. The reason why it fails on RoBERTa should be investigated. 2.Limited Novelty. Max-pooling and attention are standard tools. The paper's contribution is more of an incremental engineering improvement by
The paper is well-structured and easy to follow; the proposed modification is clearly described and easy to understand.
The empirical results are not strong enough to convince me that the proposed modification is useful. On 5 out of 9 datasets, the performance gap between the best-performing variant and vanilla BERT is less than 0.5. There is no clear winner among the proposed variants, making it unclear which one to use. Additionally, the method requires extra hyperparameter tuning (e.g., selecting the last $k$ layers).
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Neural Network Applications · Machine Learning and Data Classification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Linear Layer · Residual Connection · Weight Decay · Dropout
