Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models
Wataru Ikeda, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Keigo Shibata, Jun Suzuki

TL;DR
This paper examines the importance of feed-forward networks in Transformer models during pretraining by experimentally reallocating FFN capacity across layers, revealing that concentrating FFNs in middle layers improves downstream task performance.
Contribution
It introduces a novel experimental method to analyze FFN importance by reallocating parameters and trains models from scratch to assess layerwise significance during pretraining.
Findings
Concentrating FFNs in middle layers enhances downstream task performance.
Models with redistributed FFNs outperform standard configurations.
Layerwise importance of FFNs varies with position and size.
Abstract
This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
