WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset
Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan, A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

TL;DR
WikiWeb2M is a comprehensive multimodal Wikipedia dataset that includes full webpage data, enabling advanced research in webpage understanding and multimodal tasks.
Contribution
It introduces the first dataset to retain complete webpage images, text, and structure, facilitating new multimodal webpage understanding tasks.
Findings
Enables page description generation and summarization.
Supports contextual image captioning tasks.
Provides a large-scale, structured webpage dataset.
Abstract
Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page. WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
