Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts
Jiaqing Liu, Chong Deng, Qinglin Zhang, Shilin Zhou, Qian Chen, Hai, Yu, Wen Wang

TL;DR
This paper introduces the CoS2W task to improve ASR transcript readability by converting spoken language into formal written text using large language models, supported by a new benchmark dataset and evaluation methods.
Contribution
It proposes the CoS2W task, constructs the SWAB benchmark dataset, and demonstrates how LLMs can effectively improve and evaluate spoken-to-written conversion.
Findings
LLMs excel in grammaticality and formality in CoS2W tasks.
Context and auxiliary information significantly enhance LLM performance.
LLM evaluators correlate strongly with human judgments.
Abstract
Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsInterpreting and Communication in Healthcare · Speech Recognition and Synthesis
