Generating Human Readable Transcript for Automatic Speech Recognition   with Pre-trained Language Model

Junwei Liao; Yu Shi; Ming Gong; Linjun Shou; Sefik Eskimez; Liyang Lu,; Hong Qu; Michael Zeng

arXiv:2102.11114·cs.CL·February 23, 2021

Generating Human Readable Transcript for Automatic Speech Recognition with Pre-trained Language Model

Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Sefik Eskimez, Liyang Lu,, Hong Qu, Michael Zeng

PDF

Open Access

TL;DR

This paper introduces a post-processing model using a fine-tuned RoBERTa to improve the readability of ASR transcripts, significantly reducing errors and enhancing human and downstream task usability.

Contribution

It presents a novel data augmentation and two-stage training approach for fine-tuning a pre-trained language model to produce more human-readable ASR transcripts.

Findings

01

Outperforms baseline by 13.26 RA-WER

02

Achieves 17.53 higher BLEU score

03

Human evaluation shows improved readability

Abstract

Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to disfluency, filter words, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR system alike will be propagated to the next task in the pipeline. In this work, we propose an ASR post-processing model that aims to transform the incorrect and noisy ASR output into a readable text for humans and downstream tasks. We leverage the Metadata Extraction (MDE) corpus to construct a task-specific dataset for our study. Since the dataset is small, we propose a novel data augmentation method and use a two-stage training strategy to fine-tune the RoBERTa pre-trained model. On the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsLinear Layer · Linear Warmup With Linear Decay · Softmax · Adam · Multi-Head Attention · Residual Connection · Dropout · WordPiece · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?