WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections
Mingda Chen, Sam Wiseman, Kevin Gimpel

TL;DR
WikiTableT is a large-scale dataset designed for data-to-text generation of Wikipedia sections, enabling research on long-form, multi-domain text generation with diverse metadata and challenging coherence and factuality issues.
Contribution
We introduce WikiTableT, a comprehensive dataset for Wikipedia section generation, and benchmark various strategies, highlighting challenges in coherence and factuality.
Findings
Best models generate fluent, high-quality texts
Models struggle with coherence and factuality
Dataset covers diverse topics and generation tasks
Abstract
Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for our dataset to inspire future work on long-form generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Cancer-related gene regulation
