BASS: Block-wise Adaptation for Speech Summarization
Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita, Singh, Bhiksha Raj

TL;DR
This paper introduces BASS, a block-wise training method for speech summarization that enables models to handle very long inputs effectively, improving performance over truncated input baselines.
Contribution
The paper presents a novel block-wise training approach for speech summarization, allowing incremental learning on long sequences and passing semantic context across blocks.
Findings
Achieved a 3-point absolute improvement on ROUGE-L over baseline.
Demonstrated effective training on very long sequences.
Enabled streaming-based speech summarization.
Abstract
End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the input frames at a time. In this paper, we develop a method that allows one to train summarization models on very long sequences in an incremental manner. Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block based on new acoustic information. We devise and test strategies to pass semantic context across the blocks. Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
