A Study on Transformer Configuration and Training Objective
Fuzhao Xue, Jianghai Chen, Aixin Sun, Xiaozhe Ren, Zangwei Zheng,, Xiaoxin He, Yongming Chen, Xin Jiang, Yang You

TL;DR
This paper investigates transformer configurations and training objectives, demonstrating that deeper, narrower models with masked autoencoders improve performance on vision and language tasks.
Contribution
It introduces Bamboo, a novel transformer configuration using deeper and narrower structures, and shows its effectiveness through theoretical analysis and experiments.
Findings
Deeper, narrower transformers with masked autoencoders outperform conventional models.
The re-designed model achieves 87.1% top-1 accuracy on ImageNet.
The model outperforms BERT by 1.1 points on GLUE datasets.
Abstract
Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Transformer · Weight Decay
