Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation
Tsun-An Hsieh, Heeyoul Choi, Minje Kim

TL;DR
This paper introduces a timed text regularization method that leverages language model semantics to enhance speech separation, aligning audio with timed text without needing auxiliary text data during testing.
Contribution
The paper proposes a novel regularization approach using pretrained WavLM and BERT models to improve speech separation by aligning audio and text embeddings.
Findings
TTR improves separation metrics over unregularized models
The method effectively aligns audio sources with timed text semantics
Experimental results demonstrate consistent performance gains
Abstract
Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to improve speech separation models. Our approach involves two steps. We begin with two pretrained audio and language models, WavLM and BERT, respectively. Then, a Transformer-based audio summarizer is learned to align the audio and word embeddings and to minimize their gap. The summarizer Transformer, incorporated as a regularizer, promotes the separated sources' alignment with the semantics from the timed text. Experimental results show that the proposed TTR method consistently improves the various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · WordPiece · Residual Connection · Softmax · ALIGN · Layer Normalization
