Leveraging the true depth of LLMs

Ram\'on Calvo Gonz\'alez; Daniele Paliotta; Matteo Pagliardini; Martin Jaggi; Fran\c{c}ois Fleuret

arXiv:2502.02790·cs.LG·January 7, 2026

Leveraging the true depth of LLMs

Ram\'on Calvo Gonz\'alez, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, Fran\c{c}ois Fleuret

PDF

Open Access

TL;DR

This paper presents a novel method to restructure LLM computations by parallelizing layer pairs, achieving significant throughput improvements with minimal accuracy loss, facilitating more efficient large-scale deployment.

Contribution

The authors introduce a graph restructuring technique that enables parallel evaluation of layer pairs in LLMs without retraining, improving inference speed.

Findings

01

1.19x throughput gain on Llama 2 7B

02

Only 1.5% accuracy reduction

03

Lightweight fine-tuning recovers some accuracy

Abstract

The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these insights have not been translated into significant inference speedups. To bridge this gap, we introduce a novel method that restructures the computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach, requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average benchmark accuracy by only 1.5\%. We demonstrate the practical value of this method for large-scale LLM deployment and show that some of the lost accuracy can be recovered with lightweight fine-tuning of the parallelized layers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies