SHARP: Accelerating Language Model Inference by SHaring Adjacent layers   with Recovery Parameters

Yiping Wang; Hanxian Huang; Yifang Chen; Jishen Zhao; Simon Shaolei; Du; Yuandong Tian

arXiv:2502.07832·cs.LG·February 13, 2025

SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters

Yiping Wang, Hanxian Huang, Yifang Chen, Jishen Zhao, Simon Shaolei, Du, Yuandong Tian

PDF

Open Access

TL;DR

SHARP is a method that shares parameters between adjacent layers of large language models to reduce memory and computation costs, while recovery parameters and fine-tuning maintain model performance on resource-limited devices.

Contribution

The paper introduces SHARP, a novel layer-sharing approach with recovery parameters, enabling efficient LLM inference with minimal performance loss on mobile devices.

Findings

01

Reduces model storage by up to 65%

02

Cuts inference time by 42.2% on mobile devices

03

Maintains perplexity with limited fine-tuning data

Abstract

While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, we propose SHARP (SHaring Adjacent Layers with Recovery Parameters), a novel approach to accelerate LLM inference by sharing parameters across adjacent layers, thus reducing memory load overhead, while introducing low-rank recovery parameters to maintain performance. Inspired by observations that consecutive layers have similar outputs, SHARP employs a two-stage recovery process: Single Layer Warmup (SLW), and Supervised Fine-Tuning (SFT). The SLW stage aligns the outputs of the shared layers using L_2 loss, providing a good initialization for the following SFT stage to further restore the model performance. Extensive experiments demonstrate that SHARP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis