Improving Vietnamese Named Entity Recognition from Speech Using Word   Capitalization and Punctuation Recovery Models

Thai Binh Nguyen; Quang Minh Nguyen; Thi Thu Hien Nguyen; Quoc Truong; Do; Chi Mai Luong

arXiv:2010.00198·cs.CL·October 2, 2020

Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models

Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong, Do, Chi Mai Luong

PDF

1 Repo

TL;DR

This paper introduces a Vietnamese speech dataset and a new pipeline with capitalization and punctuation recovery to enhance NER performance from speech input, achieving state-of-the-art results.

Contribution

It presents the first Vietnamese speech dataset for NER, a large-scale monolingual language model, and a novel pipeline with CaPu for formatting recovery to improve NER from speech.

Findings

01

CaPu model improves NER F1-score by nearly 4%

02

Achieved 1.3% absolute F1 score improvement over previous models

03

First Vietnamese speech dataset for NER

Abstract

Studies on the Named Entity Recognition (NER) task have shown outstanding results that reach human parity on input texts with correct text formattings, such as with proper punctuation and capitalization. However, such conditions are not available in applications where the input is speech, because the text is generated from a speech recognition system (ASR), and that the system does not consider the text formatting. In this paper, we (1) presented the first Vietnamese speech dataset for NER task, and (2) the first pre-trained public large-scale monolingual language model for Vietnamese that achieved the new state-of-the-art for the Vietnamese NER task by 1.3% absolute F1 score comparing to the latest study. And finally, (3) we proposed a new pipeline for NER task from speech that overcomes the text formatting problem by introducing a text capitalization and punctuation recovery model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nguyenvulebinh/vietnamese-roberta
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.