Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic   Models

Aviv Bick; Kevin Y. Li; Eric P. Xing; J. Zico Kolter; Albert Gu

arXiv:2408.10189·cs.LG·February 11, 2025·3 cites

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, Albert Gu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MOHAWK, a distillation method that converts pretrained Transformer models into more efficient subquadratic architectures like SSMs, achieving strong performance with significantly less training data.

Contribution

The authors propose a novel distillation approach that progressively transfers knowledge from Transformers to SSMs, enabling subquadratic models to leverage large-scale Transformer training.

Findings

01

Successfully distilled a Mamba-2 variant using only 3B tokens

02

Achieved superior performance compared to previous non-Transformer models

03

Demonstrated that SSMs can benefit from Transformer training resources

Abstract

Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs). The key idea to our approach is that we can view both Transformers and SSMs as applying different forms of mixing matrices over the token sequences. We can thus progressively distill the Transformer architecture by matching different degrees of granularity in the SSM: first matching the mixing matrices themselves, then the hidden units at each block, and finally the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yynil/rwkvinllama
pytorch

Videos

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Reservoir Engineering and Simulation Methods

MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax