TL;DR
This paper introduces SponSpeech, a new dataset of spontaneous speech with punctuation, addressing the gap in existing models trained mainly on scripted data, and provides tools for data generation and evaluation.
Contribution
The paper presents SponSpeech, a spontaneous speech dataset with punctuation, and a filtering pipeline for data quality, enabling better training and evaluation of punctuation restoration models.
Findings
SponSpeech includes spontaneous speech with punctuation and casing.
A filtering pipeline improves data quality for training.
A challenging test set evaluates models' use of audio cues.
Abstract
Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
