Joint Fine-tuning and Conversion of Pretrained Speech and Language   Models towards Linear Complexity

Mutian He; Philip N. Garner

arXiv:2410.06846·cs.CL·March 14, 2025

Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity

Mutian He, Philip N. Garner

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CALD, a method for converting pretrained transformer models into linear time models and fine-tuning them for specific tasks, improving efficiency while retaining original performance.

Contribution

The paper proposes CALD, a novel joint conversion and fine-tuning approach for pretrained models to achieve linear complexity, applicable across speech and language domains.

Findings

01

CALD effectively recovers original model performance.

02

Guiding strategies improve fine-tuning outcomes.

03

Linear models can match transformer performance in various tasks.

Abstract

Architectures such as Linformer and Mamba have recently emerged as competitive linear time replacements for transformers. However, corresponding large pretrained models are often unavailable, especially in non-text domains. To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task. We also compare several means to guide the fine-tuning to optimally retain the desired inference capability from the original model. The methods differ in their use of the target model and the trajectory of the parameters. In a series of empirical studies on language processing, language modeling, and speech processing, we show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result. Some reasons for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

idiap/linearize-distill-pretrained-transformers
pytorchOfficial

Videos

Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · Layer Normalization · Dense Connections · Linear Layer · Multi-Head Linear Attention · Residual Connection · Linformer · Mamba: Linear-Time Sequence Modeling with Selective State Spaces