Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization
Litian Zhang, Xiaoming Zhang, Junshu Pan, Feiran Huang

TL;DR
This paper introduces a hierarchical cross-modality semantic correlation learning model (HCSCL) for multimodal summarization, effectively capturing hierarchical and intra/inter-modal correlations to improve summary quality.
Contribution
The paper proposes a novel HCSCL model that encodes intra-modal and hierarchical cross-modal correlations using graph networks and a hierarchical fusion framework.
Findings
HCSCL outperforms baseline methods in automatic metrics.
The model achieves higher diversity in generated summaries.
Extensive experiments validate the effectiveness of the approach.
Abstract
Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content. Multimodal news report contains heterogeneous contents, which makes MSMO nontrivial. Moreover, it is observed that different modalities of data in the news report correlate hierarchically. Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data, which is not directly adaptable to the heterogeneous contents and hierarchical correlation. In this paper, we propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intra- and inter-modal correlation existing in the multimodal data. HCSCL adopts a graph network to encode the intra-modal correlation. Then, a hierarchical fusion framework is proposed to learn the hierarchical correlation between text and images. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Text and Document Classification Technologies
