See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization
Yash Kumar Atri, Shraman Pramanick, Vikram Goyal, Tanmoy Chakraborty

TL;DR
This paper introduces AVIATE, a large-scale multimodal video dataset with diverse durations, and proposes FLORAL, a novel Transformer-based model that effectively leverages multimodal information for abstractive video summarization, outperforming existing methods.
Contribution
The paper presents the first large-scale diverse-duration video dataset AVIATE and a new multimodal Transformer model FLORAL for improved abstractive summarization.
Findings
FLORAL outperforms baseline models on AVIATE and How2 datasets.
Increased self-attention improves multimodal feature integration.
Significant ROUGE-L score improvements demonstrate effectiveness.
Abstract
In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Additionally, there exists no benchmark dataset to generalize this task on videos of varying lengths. In this paper, we introduce AVIATE, the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding research papers as the reference summaries, which ensure adequate quality and uniformity of the ground-truth. We then propose FLORAL,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Adam · Layer Normalization · Dense Connections
