Self-Supervised Multimodal Opinion Summarization
Jinbae Im, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung

TL;DR
This paper introduces MultimodalSum, a self-supervised framework that leverages both text and non-text review data, including images and metadata, to generate more comprehensive opinion summaries.
Contribution
It proposes a novel multimodal training pipeline with separate encoders for each modality and end-to-end fusion, enhancing opinion summarization with non-text data.
Findings
MultimodalSum outperforms text-only models on Yelp and Amazon datasets.
Pretraining on individual modalities improves overall summarization quality.
Incorporating non-text data significantly enhances summary informativeness.
Abstract
Recently, opinion summarization, which is the generation of a summary from multiple reviews, has been conducted in a self-supervised manner by considering a sampled review as a pseudo summary. However, non-text data such as image and metadata related to reviews have been considered less often. To use the abundant information contained in non-text data, we propose a self-supervised multimodal opinion summarization framework called MultimodalSum. Our framework obtains a representation of each modality using a separate encoder for each modality, and the text decoder generates a summary. To resolve the inherent heterogeneity of multimodal data, we propose a multimodal training pipeline. We first pretrain the text encoder--decoder based solely on text modality data. Subsequently, we pretrain the non-text modality encoders by considering the pretrained text decoder as a pivot for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
