Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation
Paul Tardy, David Janiszek, Yannick Est\`eve, Vincent Nguyen

TL;DR
This paper introduces an automatic alignment method for meeting transcriptions to create a dataset for neural summarization, demonstrating significant improvements in summarization quality through better data annotation.
Contribution
It proposes a bootstrapping approach for automatic transcription alignment, reducing annotation effort and enhancing dataset size for training summarization models.
Findings
Automatic alignment improves ROUGE scores by nearly +4 points.
The method reduces human annotation effort significantly.
The created corpus enables better generalization for summarization models.
Abstract
Summarizing texts is not a straightforward task. Before even considering text summarization, one should determine what kind of summary is expected. How much should the information be compressed? Is it relevant to reformulate or should the summary stick to the original phrasing? State-of-the-art on automatic text summarization mostly revolves around news articles. We suggest that considering a wider variety of tasks would lead to an improvement in the field, in terms of generalization and robustness. We explore meeting summarization: generating reports from automatic transcriptions. Our work consists in segmenting and aligning transcriptions with respect to reports, to get a suitable dataset for neural summarization. Using a bootstrapping approach, we provide pre-alignments that are corrected by human annotators, making a validation set against which we evaluate automatic models. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
