Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

Amit Meghanani; Thomas Hain

arXiv:2601.21084·cs.CL·January 30, 2026

Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

Amit Meghanani, Thomas Hain

PDF

Open Access

TL;DR

This paper addresses the challenge of position dependence in self-supervised speech representations during speech enhancement fine-tuning, proposing position-invariant methods that improve convergence and performance.

Contribution

It introduces position-invariant fine-tuning strategies for SSL-based speech enhancement, notably using soft-DTW loss and zero-padding, to mitigate positional bias.

Findings

01

Soft-DTW-based fine-tuning converges faster.

02

Position-invariant methods improve downstream speech enhancement performance.

03

Zero-padding alone is less effective than soft-DTW in this context.

Abstract

Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis