Spontaneous Informal Speech Dataset for Punctuation Restoration

Xing Yi Liu; Homayoon Beigi

arXiv:2409.11241·cs.CL·September 18, 2024

Spontaneous Informal Speech Dataset for Punctuation Restoration

Xing Yi Liu, Homayoon Beigi

PDF

1 Repo

TL;DR

This paper introduces SponSpeech, a new dataset of spontaneous speech with punctuation, addressing the gap in existing models trained mainly on scripted data, and provides tools for data generation and evaluation.

Contribution

The paper presents SponSpeech, a spontaneous speech dataset with punctuation, and a filtering pipeline for data quality, enabling better training and evaluation of punctuation restoration models.

Findings

01

SponSpeech includes spontaneous speech with punctuation and casing.

02

A filtering pipeline improves data quality for training.

03

A challenging test set evaluates models' use of audio cues.

Abstract

Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

githubaccountanonymous/pr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.