Speech Summarization using Restricted Self-Attention

Roshan Sharma; Shruti Palaskar; Alan W Black; Florian Metze

arXiv:2110.06263·cs.CL·January 26, 2022

Speech Summarization using Restricted Self-Attention

Roshan Sharma, Shruti Palaskar, Alan W Black, Florian Metze

PDF

Open Access 1 Models

TL;DR

This paper introduces an end-to-end speech summarization model using restricted self-attention, enabling efficient processing of long audio sequences and outperforming cascade models on ROUGE and F-1 metrics.

Contribution

The work applies restricted self-attention to speech models for the first time, allowing direct speech summarization without cascaded components.

Findings

01

End-to-end model outperforms cascaded model by 3 ROUGE points.

02

Model achieves 4 F-1 points higher in concept prediction.

03

Restricted self-attention enables efficient long-sequence speech processing.

Abstract

Speech summarization is typically performed by using a cascade of speech recognition and text summarization models. End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences. Recent work in document summarization has inspired methods to reduce the complexity of self-attentions, which enables transformer models to handle long sequences. In this work, we introduce a single model optimized end-to-end for speech summarization. We apply the restricted self-attention technique from text-based models to speech models to address the memory and compute constraints. We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos. The proposed end-to-end model outperforms the previously proposed cascaded model by 3 points absolute on ROUGE. Further, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
espnet/roshansh_how2_asr_raw_ft_sum_valid.acc
model· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Music and Audio Processing · Topic Modeling