Modeling Paragraph-Level Vision-Language Semantic Alignment for   Multi-Modal Summarization

Chenhao Cui; Xinnian Liang; Shuangzhi Wu; Zhoujun Li

arXiv:2208.11303·cs.CL·May 11, 2023·1 cites

Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization

Chenhao Cui, Xinnian Liang, Shuangzhi Wu, Zhoujun Li

PDF

Open Access

TL;DR

This paper introduces ViL-Sum, a joint multi-modal encoder that models paragraph-level vision-language semantic alignment for improved multi-modal summarization, outperforming existing cascaded methods.

Contribution

It proposes a novel joint multi-modal encoder with image reordering and selection tasks to better capture semantic alignments in multi-modal summarization.

Findings

01

ViL-Sum significantly outperforms state-of-the-art methods.

02

The reordering and selection tasks effectively guide semantic alignment.

03

Joint encoder captures interactions between images and paragraphs.

Abstract

Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features, then these features are fused with language representations to generate the summary with an encoder-decoder model. The cascaded way cannot capture the semantic alignments between images and paragraphs, which are crucial to a precise summary. In this paper, we propose ViL-Sum to jointly model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection. The joint multi-modal encoder captures the interactions between modalities, where the reordering task guides the model to learn paragraph-level semantic alignment and the selection task guides the model to selected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques