MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

Hyeyeon Kim; Sungwoo Han; Jingun Kwon; Hidetaka Kamigaito; Manabu Okumura

arXiv:2508.17199·cs.CV·August 26, 2025

MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

Hyeyeon Kim, Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

PDF

TL;DR

This paper introduces MMCIG, a new task for generating a concise summary and a matching image from text-only documents, along with a pseudo-labeling method to create datasets for training.

Contribution

The study proposes a multimodal pseudo-labeling approach to construct high-quality datasets for cover image generation from text documents, addressing the lack of existing datasets.

Findings

01

The pseudo-labeling method produces more accurate datasets.

02

Generated images are of higher quality compared to other methods.

03

The approach effectively links summaries with appropriate images.

Abstract

In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents that contain multiple images with their captions, and their summaries by excluding factually inconsistent instances. Our approach selects one image from the multiple images accompanying the documents. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.