Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin; Wei-Ping Huang; Hao Tang; Hung-yi Lee

arXiv:2502.12672·cs.CL·April 28, 2026

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

PDF

TL;DR

Speech-FT is a two-stage fine-tuning framework that enhances cross-task generalization of speech models by reducing representational drift and restoring pre-trained features, leading to improved performance across tasks.

Contribution

It introduces a novel two-stage fine-tuning method combining representational drift reduction and weight-space interpolation to maintain cross-task generalization.

Findings

01

Speech-FT improves performance on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+.

02

It achieves better cross-task generalization than regularization-based fine-tuning methods.

03

Significant reductions in error rates on the SUPERB benchmark, e.g., phone error rate from 5.17% to 3.94%.

Abstract

Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.