Building on Efficient Foundations: Effectively Training LLMs with   Structured Feedforward Layers

Xiuying Wei; Skander Moalla; Razvan Pascanu; Caglar Gulcehre

arXiv:2406.16450·cs.CL·November 7, 2024

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre

PDF

Open Access 1 Repo

TL;DR

This paper investigates structured low-rank and block-diagonal feedforward layers in large language models, demonstrating computational efficiency gains and proposing a novel self-guided training method to improve training dynamics and scaling performance.

Contribution

It introduces a training-from-scratch approach for structured FFNs in transformer-based LLMs, scaling up to 1.3B parameters, and proposes self-guided training to enhance their training stability and efficiency.

Findings

01

Structured FFNs enable computational gains in LLMs.

02

Self-guided training improves the training dynamics of structured FFNs.

03

Structured models can achieve lower loss with fewer parameters at optimal trade-offs.

Abstract

State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

claire-labo/structuredffn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Artificial Intelligence in Law

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings