Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation

Tsun-An Hsieh; Heeyoul Choi; Minje Kim

arXiv:2406.08328·eess.AS·August 6, 2025·Interspeech

Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation

Tsun-An Hsieh, Heeyoul Choi, Minje Kim

PDF

Open Access

TL;DR

This paper introduces a timed text regularization method that leverages language model semantics to enhance speech separation, aligning audio with timed text without needing auxiliary text data during testing.

Contribution

The paper proposes a novel regularization approach using pretrained WavLM and BERT models to improve speech separation by aligning audio and text embeddings.

Findings

01

TTR improves separation metrics over unregularized models

02

The method effectively aligns audio sources with timed text semantics

03

Experimental results demonstrate consistent performance gains

Abstract

Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to improve speech separation models. Our approach involves two steps. We begin with two pretrained audio and language models, WavLM and BERT, respectively. Then, a Transformer-based audio summarizer is learned to align the audio and word embeddings and to minimize their gap. The summarizer Transformer, incorporated as a regularizer, promotes the separated sources' alignment with the semantics from the timed text. Experimental results show that the proposed TTR method consistently improves the various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · WordPiece · Residual Connection · Softmax · ALIGN · Layer Normalization