M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
Nikhil Bhendawade, Mahyar Najibi, Devang Naik, Irina Belousova

TL;DR
M2R2 introduces a dynamic residual modulation framework for transformers that improves inference speed and maintains quality by adjusting residual velocity based on token complexity, outperforming existing methods.
Contribution
The paper proposes M2R2, a novel method that dynamically modulates residual velocity in transformers, enhancing inference efficiency over static or distance-based approaches.
Findings
Achieves up to 2.8x speedup on reasoning tasks like MT-Bench.
Outperforms state-of-the-art distance-based residual strategies.
Reduces expert-switching and accelerates decoding in MoE architectures.
Abstract
Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade-off between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment, enhancing inference efficiency. Evaluations on reasoning oriented tasks such as Koala, Self-Instruct, WizardLM, and MT-Bench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNon-Destructive Testing Techniques · Image and Signal Denoising Methods · Fault Detection and Control Systems
