Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference
Wangchunshu Zhou, Ronan Le Bras, Yejin Choi

TL;DR
Modular Transformers introduce a flexible, modular approach to compress pre-trained Transformer models, enabling adjustable performance-efficiency trade-offs through reassemblable layers after a single training phase.
Contribution
This paper presents a novel modularized framework for Transformer compression that allows dynamic adjustment of model size and performance post-training.
Findings
Achieves compression ratios from 1.1x to 6x with minimal performance loss.
Enables flexible model assembly for different efficiency needs.
Single training phase suffices for multiple compression levels.
Abstract
Pre-trained Transformer models like T5 and BART have advanced the state of the art on a wide range of text generation tasks. Compressing these models into smaller ones has become critically important for practical use. Common neural network compression techniques such as knowledge distillation or quantization are limited to static compression where the compression ratio is fixed. In this paper, we introduce Modular Transformers, a modularized encoder-decoder framework for flexible sequence-to-sequence model compression. Modular Transformers train modularized layers that have the same function of two or more consecutive layers in the original model via module replacing and knowledge distillation. After training, the modularized layers can be flexibly assembled into sequence-to-sequence models that meet different performance-efficiency trade-offs. Experimental results show that after a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsGated Linear Unit · Attention Is All You Need · SentencePiece · Adafactor · Label Smoothing · Absolute Position Encodings · Linear Layer · Dropout · Attention Dropout · Position-Wise Feed-Forward Layer
