Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization
Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

TL;DR
Speech-FT is a two-stage fine-tuning framework that enhances cross-task generalization of speech models by reducing representational drift and restoring pre-trained features, leading to improved performance across tasks.
Contribution
It introduces a novel two-stage fine-tuning method combining representational drift reduction and weight-space interpolation to maintain cross-task generalization.
Findings
Speech-FT improves performance on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+.
It achieves better cross-task generalization than regularization-based fine-tuning methods.
Significant reductions in error rates on the SUPERB benchmark, e.g., phone error rate from 5.17% to 3.94%.
Abstract
Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
