Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization
Chenhao Cui, Xinnian Liang, Shuangzhi Wu, Zhoujun Li

TL;DR
This paper introduces ViL-Sum, a joint multi-modal encoder that models paragraph-level vision-language semantic alignment for improved multi-modal summarization, outperforming existing cascaded methods.
Contribution
It proposes a novel joint multi-modal encoder with image reordering and selection tasks to better capture semantic alignments in multi-modal summarization.
Findings
ViL-Sum significantly outperforms state-of-the-art methods.
The reordering and selection tasks effectively guide semantic alignment.
Joint encoder captures interactions between images and paragraphs.
Abstract
Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features, then these features are fused with language representations to generate the summary with an encoder-decoder model. The cascaded way cannot capture the semantic alignments between images and paragraphs, which are crucial to a precise summary. In this paper, we propose ViL-Sum to jointly model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection. The joint multi-modal encoder captures the interactions between modalities, where the reordering task guides the model to learn paragraph-level semantic alignment and the selection task guides the model to selected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
