FormalASR: End-to-End Spoken Chinese to Formal Text
Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

TL;DR
FormalASR introduces end-to-end models that directly convert spoken Chinese into formal written text, reducing errors and improving quality without needing post-processing LLMs.
Contribution
The paper presents a novel end-to-end spoken-to-formal transcription model and large-scale datasets, enabling on-device, high-quality Chinese speech recognition for formal writing.
Findings
Achieves up to 37.4% CER reduction over baselines
Improves ROUGE-L and BERTScore metrics
Requires no post-processing LLM at deployment
Abstract
Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
