TL;DR
This paper introduces a Vietnamese speech dataset and a new pipeline with capitalization and punctuation recovery to enhance NER performance from speech input, achieving state-of-the-art results.
Contribution
It presents the first Vietnamese speech dataset for NER, a large-scale monolingual language model, and a novel pipeline with CaPu for formatting recovery to improve NER from speech.
Findings
CaPu model improves NER F1-score by nearly 4%
Achieved 1.3% absolute F1 score improvement over previous models
First Vietnamese speech dataset for NER
Abstract
Studies on the Named Entity Recognition (NER) task have shown outstanding results that reach human parity on input texts with correct text formattings, such as with proper punctuation and capitalization. However, such conditions are not available in applications where the input is speech, because the text is generated from a speech recognition system (ASR), and that the system does not consider the text formatting. In this paper, we (1) presented the first Vietnamese speech dataset for NER task, and (2) the first pre-trained public large-scale monolingual language model for Vietnamese that achieved the new state-of-the-art for the Vietnamese NER task by 1.3% absolute F1 score comparing to the latest study. And finally, (3) we proposed a new pipeline for NER task from speech that overcomes the text formatting problem by introducing a text capitalization and punctuation recovery model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
