A Modular Approach for Multimodal Summarization of TV Shows
Louis Mahon, Mirella Lapata

TL;DR
This paper introduces a modular multimodal approach for TV show summarization, combining scene detection, reordering, visual-to-text conversion, dialogue summarization, and fact-based evaluation to produce higher quality summaries.
Contribution
It proposes a flexible modular framework for TV show summarization and introduces PRISMA, a new metric for fact-based summary evaluation.
Findings
Outperforms comparison models on ROUGE scores
Achieves higher quality summaries as per human evaluation
Introduces PRISMA metric for factual summary assessment
Abstract
In this paper we address the task of summarizing television shows, which touches key areas in AI research: complex reasoning, multiple modalities, and long narratives. We present a modular approach where separate components perform specialized sub-tasks which we argue affords greater flexibility compared to end-to-end methods. Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode. We also present a new metric, PRISMA (Precision and Recall EvaluatIon of Summary FActs), to measure both precision and recall of generated summaries, which we decompose into atomic facts. Tested on the recently released SummScreen3D dataset, our method produces higher quality summaries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Text Analysis Techniques · Natural Language Processing Techniques
