A Study on Transformer Configuration and Training Objective

Fuzhao Xue; Jianghai Chen; Aixin Sun; Xiaozhe Ren; Zangwei Zheng,; Xiaoxin He; Yongming Chen; Xin Jiang; Yang You

arXiv:2205.10505·cs.LG·May 19, 2023·1 cites

A Study on Transformer Configuration and Training Objective

Fuzhao Xue, Jianghai Chen, Aixin Sun, Xiaozhe Ren, Zangwei Zheng,, Xiaoxin He, Yongming Chen, Xin Jiang, Yang You

PDF

Open Access 1 Video

TL;DR

This paper investigates transformer configurations and training objectives, demonstrating that deeper, narrower models with masked autoencoders improve performance on vision and language tasks.

Contribution

It introduces Bamboo, a novel transformer configuration using deeper and narrower structures, and shows its effectiveness through theoretical analysis and experiments.

Findings

01

Deeper, narrower transformers with masked autoencoders outperform conventional models.

02

The re-designed model achieves 87.1% top-1 accuracy on ImageNet.

03

The model outperforms BERT by 1.1 points on GLUE datasets.

Abstract

Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Study on Transformer Configuration and Training Objective· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Transformer · Weight Decay