Tuning Language Models by Mixture-of-Depths Ensemble

Haoyan Luo; Lucia Specia

arXiv:2410.13077·cs.CL·October 18, 2024

Tuning Language Models by Mixture-of-Depths Ensemble

Haoyan Luo, Lucia Specia

PDF

Open Access

TL;DR

This paper introduces Mixture-of-Depths (MoD), a novel framework that leverages intermediate layers of transformer-based language models through learned ensemble routing, improving performance and efficiency in language modeling tasks.

Contribution

The paper proposes MoD, a new tuning method that utilizes intermediate layers as ensembles with learned routing, enhancing language model training and reducing parameters needed.

Findings

01

MoD improves performance across various language modeling tasks.

02

Using intermediate layers as ensembles can match or surpass final-layer tuning.

03

MoD achieves similar results with fewer trainable parameters.

Abstract

Transformer-based Large Language Models (LLMs) traditionally rely on final-layer loss for training and final-layer representations for predictions, potentially overlooking the predictive power embedded in intermediate layers. Surprisingly, we find that focusing training efforts on these intermediate layers can yield training losses comparable to those of final layers, with complementary test-time performance. We introduce a novel tuning framework, Mixture-of-Depths (MoD), which trains late layers as ensembles contributing to the final logits through learned routing weights. With the auxiliary distillation loss and additional normalization modules, we ensure that the outputs of the late layers adapt to language modeling. Our MoD framework, which can be integrated with any existing tuning method, shows consistent improvement on various language modelling tasks. Furthermore, by replacing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques