A Modular Approach for Multimodal Summarization of TV Shows

Louis Mahon; Mirella Lapata

arXiv:2403.03823·cs.CL·August 23, 2024·1 cites

A Modular Approach for Multimodal Summarization of TV Shows

Louis Mahon, Mirella Lapata

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a modular multimodal approach for TV show summarization, combining scene detection, reordering, visual-to-text conversion, dialogue summarization, and fact-based evaluation to produce higher quality summaries.

Contribution

It proposes a flexible modular framework for TV show summarization and introduces PRISMA, a new metric for fact-based summary evaluation.

Findings

01

Outperforms comparison models on ROUGE scores

02

Achieves higher quality summaries as per human evaluation

03

Introduces PRISMA metric for factual summary assessment

Abstract

In this paper we address the task of summarizing television shows, which touches key areas in AI research: complex reasoning, multiple modalities, and long narratives. We present a modular approach where separate components perform specialized sub-tasks which we argue affords greater flexibility compared to end-to-end methods. Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode. We also present a new metric, PRISMA (Precision and Recall EvaluatIon of Summary FActs), to measure both precision and recall of generated summaries, which we decompose into atomic facts. Tested on the recently released SummScreen3D dataset, our method produces higher quality summaries…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lou1sm/modular_multimodal_summarization
pytorchOfficial

Videos

A Modular Approach for Multimodal Summarization of TV Shows· underline

Taxonomy

TopicsAdvanced Text Analysis Techniques · Natural Language Processing Techniques