Transformer Layer Injection: A Novel Approach for Efficient Upscaling of   Large Language Models

James Vo

arXiv:2410.11654·cs.CL·October 16, 2024

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

James Vo

PDF

Open Access

TL;DR

Transformer Layer Injection (TLI) is a new method for efficiently scaling large language models by injecting layers to improve initialization and reduce training, outperforming existing techniques in accuracy and cost-effectiveness.

Contribution

The paper introduces Transformer Layer Injection (TLI), a novel layer injection technique that enhances model scaling efficiency and performance with minimal disruption to existing transformer architectures.

Findings

01

TLI outperforms DUS, MoE, and other methods in experiments.

02

Models with TLI require fewer training steps and achieve higher accuracy.

03

TLI is scalable from 10B to 405B parameters, demonstrating broad applicability.

Abstract

In this paper, we propose Transformer Layer Injection (TLI), a novel method for efficiently upscaling large language models (LLMs) while minimizing computational costs and maintaining model performance. Model scale is a key factor in enhancing the quality of machine learning models, and TLI addresses the challenge of scaling by reducing initial loss, minimizing fine-tuning requirements, and preserving model complexity. Our approach improves upon the conventional Depth Up-Scaling (DUS) technique by injecting new layers into every set of K layers, enabling hidden representations to pass through transformer blocks with minimal disruption. We compare TLI with existing approaches, including Mixture of Experts (MoE) and DUS, and validate its efficiency through experiments on small LLMs (LLama3 1B, 3B, and 8B). Results show that TLI achieves better initialization, requires fewer training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Attention Is All You Need · Linear Layer