WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Andrea Burns; Krishna Srinivasan; Joshua Ainslie; Geoff Brown; Bryan; A. Plummer; Kate Saenko; Jianmo Ni; Mandy Guo

arXiv:2305.05432·cs.CL·May 10, 2023·1 cites

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan, A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

PDF

Open Access 2 Repos

TL;DR

WikiWeb2M is a comprehensive multimodal Wikipedia dataset that includes full webpage data, enabling advanced research in webpage understanding and multimodal tasks.

Contribution

It introduces the first dataset to retain complete webpage images, text, and structure, facilitating new multimodal webpage understanding tasks.

Findings

01

Enables page description generation and summarization.

02

Supports contextual image captioning tasks.

03

Provides a large-scale, structured webpage dataset.

Abstract

Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page. WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling