GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
Guinan Su, Li Shen, Lu Yin, Shiwei Liu, Yanwu Yang, Jonas Geiping

TL;DR
This paper introduces GPTailor, a novel method for compressing large language models by layer cutting and stitching, which maintains high performance while significantly reducing model size.
Contribution
The paper presents a new approach to LLM pruning that combines layer removal, selection, and merging from finetuned variants, optimizing model compression.
Findings
Maintains 97.3% of original performance on Llama2-13B after 25% parameter reduction.
Outperforms previous state-of-the-art pruning methods.
Uses a zero-order optimization framework for layer operations.
Abstract
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The idea of this paper is both interesting and innovative. By compressing and merging task-specific fine-tuned models, it achieves impressive compression performance without requiring additional fine-tuning. 2. The experiments are solid, demonstrating the method’s effectiveness across four categories and fourteen datasets. 3. The paper is clearly written and easy to follow.
1. Theoretical analysis is weak, and although the experimental results of the method are significant, there is a lack of theoretical explanation or visual analysis (such as layer representation similarity) on why cross-model stitching is better than single model pruning. 2. Insufficient interpretability, although a structural diagram of the pruned model is provided, there is a lack of in-depth analysis on why certain layers are retained or merged.
- Reframes pruning as cross-model layer selection/merging as opposed to single-model trimming, which is solid approach. - Flexible search space (drop / pick / merge per layer). - Strong empirical results on Llama-2-7B/13B: ~25% layers removed while retaining most of the performance performance, outperforming structured-pruning baselines by a significant margin given the same candidates. - Useful ablations showing that merging across variants is one of the main source of quality recovery, not j
- No latency/throughput inference comparison with LLM-Pruner, SliceGPT, LaCo, or ShortGPT, so the practical speedup of the 25% layer drop is unclear. - The search itself is heavy (500 SMAC trials, multi-fidelity), which somewhat offsets the advantages and could be hard to reproduce for bigger pools of models. - Assumes access to task-specialized variants of the same base model, which doesn't always match deployment settings.
1. Introduces a creative, structured pruning paradigm that assembles a smaller model by reusing layers from multiple fine-tuned variants. 2. Frames pruning as a meta-optimization problem, offering a search-based alternative to rule-based or heuristic layer removal. 3. The method is conceptually simple and potentially compatible with existing LLM fine-tuning workflows.
1. Since GPTailor builds its pruned model by mixing layers from several fine-tuned versions, it’s hard to tell whether the reported improvements actually come from the proposed search and stitching method, or simply from combining strong models together. The paper would be more convincing if it included comparisons with random or heuristic layer combinations to separate the effect of its algorithm from that of model diversity. 2. The approach assumes that we already have multiple fine-tuned ver
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsPruning · Focus
