Duo-LLM: A Framework for Studying Adaptive Computation in Large Language   Models

Keivan Alizadeh; Iman Mirzadeh; Hooman Shahrokhi; Dmitry Belenko,; Frank Sun; Minsik Cho; Mohammad Hossein Sekhavat; Moin Nabi; Mehrdad; Farajtabar

arXiv:2410.10846·cs.LG·October 16, 2024

Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

Keivan Alizadeh, Iman Mirzadeh, Hooman Shahrokhi, Dmitry Belenko,, Frank Sun, Minsik Cho, Mohammad Hossein Sekhavat, Moin Nabi, Mehrdad, Farajtabar

PDF

Open Access

TL;DR

This paper introduces Duo-LLM, a framework that enables adaptive computation in large language models by dynamically routing tokens through smaller or larger modules based on their complexity, improving efficiency and understanding of internal routing processes.

Contribution

It proposes a novel framework with auxiliary modules for dynamic token routing in LLMs, providing insights into optimal patterns and the gap between practical routing and theoretical optima.

Findings

01

Activating a large module in one layer outperforms using it across all layers.

02

Trained routers differ from oracle solutions, often being suboptimal.

03

The framework reveals the internal routing dynamics and potential for efficiency gains.

Abstract

Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget, leading to inefficient resource utilization. To address this shortcoming, recent advancements in mixture of expert (MoE) models, speculative decoding, and early exit strategies leverage the insight that computational demands can vary significantly based on the complexity and nature of the input. However, identifying optimal routing patterns for dynamic execution remains an open challenge, limiting the full potential of these adaptive methods. To address this need, we study adaptive computation in LLMs more systematically. We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM. This design enables dynamic routing of tokens based on task complexity: tokens can be processed by either the small or big modules at each layer, or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsMixture of Experts