FlattenGPT: Depth Compression for Transformer with Layer Flattening
Ruihan Xu, Qingpei Guo, Yao Zhu, Xiangyang Ji, Ming Yang, Shiliang Zhang

TL;DR
FlattenGPT introduces a novel depth compression method for transformers by flattening adjacent layers, effectively reducing model depth while preserving learned knowledge, leading to improved efficiency with minimal performance loss.
Contribution
The paper proposes FlattenGPT, a new approach that compresses transformer depth through layer flattening, enabling better redundancy detection and model acceleration without significant performance degradation.
Findings
Outperforms existing pruning methods in zero-shot accuracy and perplexity.
Retains 90-96% of performance with 20x compression on large models.
Enhances inference speed of large language models.
Abstract
Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
(1) Maintains original block width and head count, ensuring easy deployment and LoRA compatibility. (2) Demonstrates consistent gains in throughput and latency without severe accuracy drops. (3) Builds on empirical findings on strong cross-layer redundancy in transformer residual paths.
(1) Relies on small calibration sets and greedy similarity-based layer pairing, which may be unstable. (2) Evaluation focus on decoder-only LLMs, applicability to encoder or encoder-decoder architectures (like VLMs) is unverified. (3) Reported accretion results may vary across GPU architectures or models or inference backends. (4) Analytical justification of flattening equivalence is mostly empirical and heuristic.
The method provided in the manuscript seems sensible. The strength/innovation is mostly in the high level algorithm and layer merging strategies; the structured pruning strategy seems less novel/interesting (as the manuscript notes, other strategies could be used). In particular, the results seem good when compared with alternative depth compression strategies and competitive with pure structured pruning strategies (though in some places this comparison is a bit trickier particularly if more e
I think that the main weakness of the manuscript in its current form is the presentation. This manifests in two forms: (1) a significant number of grammatical errors that need to be addressed and (2) a lack of precision in various places that could lead to confusion and/or makes it hard to interpret results. The first point is important to address but it's also clear what is needed; the second point is more significant. To illustrate: I do not really think Theorems 2.1 and 2.2 add/say much.
1. The results comparing this work to prior methods like SLEB, LaCO, BlockPruner, and ShortGPT show improved performance on zero-shot tasks and perplexity, while maintaining solid latency and throughput results. 2. The empirical comparision of the residual norm vs the MHA/MLP norm to motivate the paper is a nice contribution demonstrating a potential cause of layer redundancy. 3. The proposed method is straightforward, relatively simple, and easy to understand and implement. The basic method of
1. Part of the motivation of this work discusses channel pruning as being suboptimal because it may result in differently sized transformer layers, whereas FlattenGPT preserves the Transformer layer structure, but with fewer layers. It is not clear In what specific sense is reducing the width/size of weights strictly worse than removing full layers. While speed vs channel pruning is stated, I am not understanding the hyperparameter tuning and/or model deployment issues with channel pruning. 2.
1.The concept of layer flattening—merging similar adjacent layers instead of deleting them—is both creative and intuitive. It fills an unexplored middle ground between existing pruning approaches. 2.The analysis of layer redundancy through variance and gradient norms provides strong motivation. 3.The experiments are broad, covering multiple LLM backbones, with consistent improvements in both speed and accuracy.
1. "However, these methods usually assign different pruning ratio for each layer". However, this is not accurate. Methods like Wanda and SparseGPT actually apply the same sparsity ratio across all layers. 2. The authors argue that FlattenGPT can "preserve the knowledge learned in all blocks", but in theory, channel pruning can also achieve this goal, since it retains important connections within each layer. 3. "which is caused by the residual path spanning the entire LLM. This similarity is part
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Video Coding and Compression Technologies · Advanced Neural Network Applications
