TL;DR
This paper introduces Sliding-Window Merging, a dynamic compression technique that merges similar consecutive layers in large language models to reduce redundancy and maintain performance, outperforming existing pruning methods.
Contribution
The paper presents a novel layer merging approach based on functional similarity, effectively simplifying LLMs while preserving their inference capabilities.
Findings
Outperforms existing pruning techniques in zero-shot inference.
Achieves 1.654% performance improvement with 35% pruning on Vicuna-7B.
Demonstrates potential of combining depth and width pruning.
Abstract
Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to direct removal of entire Transformer layers. This paper reveals ``Patch-like'' redundancy across layers via correlation analysis of the outputs of different layers in reproducing kernel Hilbert space, demonstrating consecutive layers exhibit high functional similarity. Building on this observation, this paper proposes Sliding-Window Merging (SWM) - a dynamic compression method that selects consecutive layers from top to bottom using a pre-defined similarity threshold, and compacts patch-redundant layers through a parameter consolidation, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Label Smoothing · Multi-Head Attention · Dense Connections · Adam
