Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization
Yanghai Zhang, Ye Liu, Shiwei Wu, Kai Zhang, Xukai Liu, Qi Liu, Enhong, Chen

TL;DR
This paper introduces EGMS, a novel multimodal summarization model that leverages entity information and dual encoders to improve text and image integration, demonstrating superior performance on public datasets.
Contribution
The paper presents a new entity-guided approach for multimodal summarization that effectively incorporates fine-grained entity knowledge using dual encoders and a gating mechanism.
Findings
EGMS outperforms existing models on public MSMO datasets.
Incorporating entity information significantly improves multimodal summarization quality.
The proposed model effectively integrates text, images, and entity data for better summaries.
Abstract
The rapid increase in multimedia data has spurred advancements in Multimodal Summarization with Multimodal Output (MSMO), which aims to produce a multimodal summary that integrates both text and relevant images. The inherent heterogeneity of content within multimodal inputs and outputs presents a significant challenge to the execution of MSMO. Traditional approaches typically adopt a holistic perspective on coarse image-text data or individual visual objects, overlooking the essential connections between objects and the entities they represent. To integrate the fine-grained entity knowledge, we propose an Entity-Guided Multimodal Summarization model (EGMS). Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently. A gating mechanism then combines visual data for enhanced textual summary generation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Adam · Byte Pair Encoding · Softmax · Dense Connections · Dropout
