SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization
Sicheng Liu, Lintao Wang, Xiaogang Zhu, Xuequan Lu, Zhiyong Wang, Kun, Hu

TL;DR
SITransformer introduces a shared information-guided transformer architecture that effectively filters and integrates multimodal data to generate accurate, concise summaries, addressing the challenge of irrelevant information in extreme multimodal summarization.
Contribution
The paper proposes a novel shared information-guided transformer with a filtering process and cross-modal attention for improved extreme multimodal summarization.
Findings
Significantly improves summarization quality for video and text data.
Effectively filters out topic-irrelevant information across modalities.
Achieves superior performance on XMSMO benchmark datasets.
Abstract
Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a Shared Information-guided Transformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsLinear Layer · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings
