Prosodic segmentation for parsing spoken dialogue
Elizabeth Nielsen, Mark Steedman, Sharon Goldwater

TL;DR
This paper explores how prosody can improve parsing of spoken dialogue by enabling turn-based models to effectively segment sentence-like units without pre-segmented input, matching the performance of models with gold-standard segmentation.
Contribution
It demonstrates that prosodic features can replace gold-standard segmentation in turn-based dialogue parsing, enabling more realistic speech processing applications.
Findings
Prosody enables turn-based models to match SU-based model performance.
Pitch and intensity are key features for boundary detection.
Prosody helps distinguish between SU boundaries and disfluencies.
Abstract
Parsing spoken dialogue poses unique difficulties, including disfluencies and unmarked boundaries between sentence-like units. Previous work has shown that prosody can help with parsing disfluent speech (Tran et al. 2018), but has assumed that the input to the parser is already segmented into sentence-like units (SUs), which isn't true in existing speech applications. We investigate how prosody affects a parser that receives an entire dialogue turn as input (a turn-based model), instead of gold standard pre-segmented SUs (an SU-based model). In experiments on the English Switchboard corpus, we find that when using transcripts alone, the turn-based model has trouble segmenting SUs, leading to worse parse performance than the SU-based model. However, prosody can effectively replace gold standard SU boundaries: with prosody, the turn-based model performs as well as the SU-based model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
