ESSumm: Extractive Speech Summarization from Untranscribed Meeting

Jun Wang

arXiv:2209.06913·eess.AS·September 16, 2022

ESSumm: Extractive Speech Summarization from Untranscribed Meeting

Jun Wang

PDF

Open Access

TL;DR

This paper introduces ESSumm, an unsupervised direct speech summarization model that generates summaries from raw audio without transcription, leveraging deep speech features and confidence scoring.

Contribution

The novel ESSumm architecture enables extractive speech summarization directly from untranscribed audio, outperforming some transcript-based methods.

Findings

01

Effective on AMI and ICSI datasets

02

Performs comparably to transcript-based approaches

03

Utilizes self-supervised CNN for feature extraction

Abstract

In this paper, we propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm, which is an unsupervised model without dependence on intermediate transcribed text. Different from previous methods with text presentation, we are aimed at generating a summary directly from speech without transcription. First, a set of smaller speech segments are extracted based on speech signal's acoustic features. For each candidate speech segment, a distance-based summarization confidence score is designed for latent speech representation measure. Specifically, we leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio. Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length. Extensive results on two well-known meeting datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis