Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models

Wataru Ikeda; Kazuki Yano; Ryosuke Takahashi; Jaesung Lee; Keigo Shibata; Jun Suzuki

arXiv:2508.17734·cs.CL·August 26, 2025

Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models

Wataru Ikeda, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Keigo Shibata, Jun Suzuki

PDF

TL;DR

This paper examines the importance of feed-forward networks in Transformer models during pretraining by experimentally reallocating FFN capacity across layers, revealing that concentrating FFNs in middle layers improves downstream task performance.

Contribution

It introduces a novel experimental method to analyze FFN importance by reallocating parameters and trains models from scratch to assess layerwise significance during pretraining.

Findings

01

Concentrating FFNs in middle layers enhances downstream task performance.

02

Models with redistributed FFNs outperform standard configurations.

03

Layerwise importance of FFNs varies with position and size.

Abstract

This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.