Recording for Eyes, Not Echoing to Ears: Contextualized   Spoken-to-Written Conversion of ASR Transcripts

Jiaqing Liu; Chong Deng; Qinglin Zhang; Shilin Zhou; Qian Chen; Hai; Yu; Wen Wang

arXiv:2408.09688·cs.CL·January 27, 2025

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Jiaqing Liu, Chong Deng, Qinglin Zhang, Shilin Zhou, Qian Chen, Hai, Yu, Wen Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces the CoS2W task to improve ASR transcript readability by converting spoken language into formal written text using large language models, supported by a new benchmark dataset and evaluation methods.

Contribution

It proposes the CoS2W task, constructs the SWAB benchmark dataset, and demonstrates how LLMs can effectively improve and evaluate spoken-to-written conversion.

Findings

01

LLMs excel in grammaticality and formality in CoS2W tasks.

02

Context and auxiliary information significantly enhance LLM performance.

03

LLM evaluators correlate strongly with human judgments.

Abstract

Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts· underline

Taxonomy

TopicsInterpreting and Communication in Healthcare · Speech Recognition and Synthesis