Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin
Abhinav Rao, Ho Thi-Nga, Chng Eng-Siong

TL;DR
This paper introduces a novel slot-filling approach for punctuation restoration in multilingual ASR transcripts, achieving state-of-the-art results for Mandarin and strong performance for English and Malay in Singaporean languages.
Contribution
It is the first system to simultaneously restore punctuation for English, Mandarin, and Malay, using a masked punctuation prediction model and improved tokenization techniques.
Findings
Achieved 73.8% F1-score for Mandarin punctuation restoration.
State-of-the-art performance for Mandarin on IWSLT2022 dataset.
Effective use of Jieba tokenizer improved Mandarin results.
Abstract
This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Softmax · WordPiece · Linear Warmup With Linear Decay · Layer Normalization
