ESSumm: Extractive Speech Summarization from Untranscribed Meeting
Jun Wang

TL;DR
This paper introduces ESSumm, an unsupervised direct speech summarization model that generates summaries from raw audio without transcription, leveraging deep speech features and confidence scoring.
Contribution
The novel ESSumm architecture enables extractive speech summarization directly from untranscribed audio, outperforming some transcript-based methods.
Findings
Effective on AMI and ICSI datasets
Performs comparably to transcript-based approaches
Utilizes self-supervised CNN for feature extraction
Abstract
In this paper, we propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm, which is an unsupervised model without dependence on intermediate transcribed text. Different from previous methods with text presentation, we are aimed at generating a summary directly from speech without transcription. First, a set of smaller speech segments are extracted based on speech signal's acoustic features. For each candidate speech segment, a distance-based summarization confidence score is designed for latent speech representation measure. Specifically, we leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio. Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length. Extensive results on two well-known meeting datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
