See, Hear, Read: Leveraging Multimodality with Guided Attention for   Abstractive Text Summarization

Yash Kumar Atri; Shraman Pramanick; Vikram Goyal; Tanmoy Chakraborty

arXiv:2105.09601·cs.LG·September 16, 2021

See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization

Yash Kumar Atri, Shraman Pramanick, Vikram Goyal, Tanmoy Chakraborty

PDF

TL;DR

This paper introduces AVIATE, a large-scale multimodal video dataset with diverse durations, and proposes FLORAL, a novel Transformer-based model that effectively leverages multimodal information for abstractive video summarization, outperforming existing methods.

Contribution

The paper presents the first large-scale diverse-duration video dataset AVIATE and a new multimodal Transformer model FLORAL for improved abstractive summarization.

Findings

01

FLORAL outperforms baseline models on AVIATE and How2 datasets.

02

Increased self-attention improves multimodal feature integration.

03

Significant ROUGE-L score improvements demonstrate effectiveness.

Abstract

In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Additionally, there exists no benchmark dataset to generalize this task on videos of varying lengths. In this paper, we introduce AVIATE, the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding research papers as the reference summaries, which ensure adequate quality and uniformity of the ground-truth. We then propose FLORAL,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Adam · Layer Normalization · Dense Connections