Fusing Multimodal Signals on Hyper-complex Space for Extreme Abstractive Text Summarization (TL;DR) of Scientific Contents
Yash Kumar Atri, Vikram Goyal, Tanmoy Chakraborty

TL;DR
This paper introduces a novel multimodal dataset and a hyper-complex Transformer model for extreme abstractive summarization of scientific content, leveraging videos, audio, and text to generate concise summaries.
Contribution
The paper presents the first dataset for multimodal extreme abstractive summarization and a novel hyper-complex Transformer model that effectively captures modality interactions in a geometric space.
Findings
mTLDRgen outperforms 20 baselines on Rouge scores
Generated summaries are fluent and source-congruent
Model effectively captures multimodal interactions
Abstract
The realm of scientific text summarization has experienced remarkable progress due to the availability of annotated brief summaries and ample data. However, the utilization of multiple input modalities, such as videos and audio, has yet to be thoroughly explored. At present, scientific multimodal-input-based text summarization systems tend to employ longer target summaries like abstracts, leading to an underwhelming performance in the task of text summarization. In this paper, we deal with a novel task of extreme abstractive text summarization (aka TL;DR generation) by leveraging multiple input modalities. To this end, we introduce mTLDR, a first-of-its-kind dataset for the aforementioned task, comprising videos, audio, and text, along with both author-composed summaries and expert-annotated summaries. The mTLDR dataset accompanies a total of 4,182 instances collected from various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Adam · Byte Pair Encoding
