RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding

Yuichiro Hoshino; Hideyuki Tachibana; Muneyoshi Inahara; Hiroto Takegawa

arXiv:2505.22135·cs.CL·May 29, 2025

RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding

Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa

PDF

Open Access

TL;DR

RAD introduces a method to improve hybrid Transformer-SSM models by identifying and replacing redundant attention layers with SSM components, leading to better performance and faster training.

Contribution

The paper presents RAD, a framework that uses self-speculative decoding to identify redundancies and enhance hybrid models through targeted distillation and component replacement.

Findings

01

RAD significantly improves performance on mathematical and coding tasks.

02

RAD achieves up to 2x faster convergence in distillation.

03

RAD outperforms baseline models even with smaller teachers.

Abstract

Hybrid models combining Transformers and State Space Models (SSMs) are promising for balancing performance and efficiency. However, optimizing these hybrid models, particularly by addressing the potential redundancy inherent within the Transformer components, remains a significant challenge. In this paper, we propose RAD (Redundancy-Aware Distillation), a novel framework that uses self-speculative decoding as a diagnostic tool to identify redundant attention layers within the model. These identified layers are then selectively replaced with SSM components, followed by targeted (self-)distillation. Specifically, RAD focuses knowledge transfer on the components identified as redundant, considering architectural changes and specific weight initialization strategies. We experimentally demonstrate that self-distillation using RAD significantly surpasses the performance of the original base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis