TL;DR
This paper introduces a multitask training method for end-to-end spoken term detection that leverages unpaired text to improve search accuracy and facilitate domain adaptation, addressing limitations of traditional ASR-based systems.
Contribution
It proposes a novel multitask training objective that incorporates unpaired text into end-to-end KWS, enhancing performance without increasing indexing complexity.
Findings
Significant improvements in search performance across multiple languages.
Enhanced document representations for words in unpaired text.
Effective domain adaptation with scarce in-domain data.
Abstract
End-to-end (E2E) approaches to keyword search (KWS) are considerably simpler in terms of training and indexing complexity when compared to approaches which use the output of automatic speech recognition (ASR) systems. This simplification however has drawbacks due to the loss of modularity. In particular, where ASR-based KWS systems can benefit from external unpaired text via a language model, current formulations of E2E KWS systems have no such mechanism. Therefore, in this paper, we propose a multitask training objective which allows unpaired text to be integrated into E2E KWS without complicating indexing and search. In addition to training an E2E KWS model to retrieve text queries from spoken documents, we jointly train it to retrieve text queries from masked written documents. We show empirically that this approach can effectively leverage unpaired text for KWS, with significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
