Adaptive Large Language Models By Layerwise Attention Shortcuts

Prateek Verma; Mert Pilanci

arXiv:2409.10870·cs.CL·December 24, 2024

Adaptive Large Language Models By Layerwise Attention Shortcuts

Prateek Verma, Mert Pilanci

PDF

Open Access

TL;DR

This paper introduces adaptive layerwise attention shortcuts in transformer-based large language models, enabling the final layer to selectively attend to intermediate layers, which improves performance across diverse datasets.

Contribution

It proposes a novel adaptive computation method allowing the final layer to attend to all intermediate layers, creating depth and context adaptivity in transformer architectures.

Findings

01

Superior performance on multiple datasets

02

Models learn complex, adaptive dependencies across layers

03

Attention maps show context-dependent layer interactions

Abstract

Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbf{attention shortcuts}. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need