TL;DR
This paper introduces an end-to-end approach for named entity recognition directly from English speech, demonstrating improved performance over traditional pipeline methods and providing a new dataset for the task.
Contribution
It presents the first publicly available NER-annotated speech dataset and an end-to-end model that jointly optimizes speech recognition and NER, outperforming sequential approaches.
Findings
End-to-end model outperforms two-step pipeline.
New publicly available NER speech dataset introduced.
Potential to handle out-of-vocabulary words in ASR.
Abstract
Named entity recognition (NER) from text has been a widely studied problem and usually extracts semantic information from text. Until now, NER from speech is mostly studied in a two-step pipeline process that includes first applying an automatic speech recognition (ASR) system on an audio sample and then passing the predicted transcript to a NER tagger. In such cases, the error does not propagate from one step to another as both the tasks are not optimized in an end-to-end (E2E) fashion. Recent studies confirm that integrated approaches (e.g., E2E ASR) outperform sequential ones (e.g., phoneme based ASR). In this paper, we introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimizes the ASR and NER tagger components. Experimental results show that the proposed E2E approach outperforms the classical two-step approach.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
