End-to-end Music-mixed Speech Recognition

Jeongwoo Woo; Masato Mimura; Kazuyoshi Yoshii; Tatsuya Kawahara

arXiv:2008.12048·eess.AS·August 28, 2020·APSIPA

End-to-end Music-mixed Speech Recognition

Jeongwoo Woo, Masato Mimura, Kazuyoshi Yoshii, Tatsuya Kawahara

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel end-to-end approach for improving speech recognition in multimedia by using time-domain source separation with Conv-TasNet, significantly reducing word error rates across various music genres.

Contribution

It proposes a joint fine-tuning method combining Conv-TasNet with an attention-based ASR, outperforming frequency-domain separation in mixed speech recognition tasks.

Findings

01

Time-domain separation drastically improves ASR performance.

02

Joint optimization further reduces word error rate.

03

Method is robust across diverse music genres.

Abstract

Automatic speech recognition (ASR) in multimedia content is one of the promising applications, but speech data in this kind of content are frequently mixed with background music, which is harmful for the performance of ASR. In this study, we propose a method for improving ASR with background music based on time-domain source separation. We utilize Conv-TasNet as a separation network, which has achieved state-of-the-art performance for multi-speaker source separation, to extract the speech signal from a speech-music mixture in the waveform domain. We also propose joint fine-tuning of a pre-trained Conv-TasNet front-end with an attention-based ASR back-end using both separation and ASR objectives. We evaluated our method through ASR experiments using speech data mixed with background music from a wide variety of Japanese animations. We show that time-domain speech-music separation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Kyoto-University-Speech-and-Audio/woo-music-mixed-speech-recognition
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis