Target word activity detector: An approach to obtain ASR word boundaries without lexicon
Sunit Sivasankaran, Eric Sun, Jinyu Li, Yan Huang, Jing Pan

TL;DR
This paper introduces a novel method for estimating word boundaries in end-to-end multilingual ASR models without using lexicons, leveraging word embeddings and a pretrained model to improve scalability and reduce costs.
Contribution
The proposed approach estimates word boundaries without lexicons, using only word alignment info and scalable to multiple languages, addressing limitations of existing methods.
Findings
Effective in multilingual settings with five languages
Outperforms strong baseline methods
Scalable without additional computational costs
Abstract
Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
