WebSRC: A Dataset for Web-Based Structural Reading Comprehension

Xingyu Chen; Zihan Zhao; Lu Chen; Danyang Zhang; Jiabao Ji; Ao Luo,; Yuxuan Xiong; Kai Yu

arXiv:2101.09465·cs.CL·November 9, 2021

WebSRC: A Dataset for Web-Based Structural Reading Comprehension

Xingyu Chen, Zihan Zhao, Lu Chen, Danyang Zhang, Jiabao Ji, Ao Luo,, Yuxuan Xiong, Kai Yu

PDF

Open Access 1 Repo 3 Datasets

TL;DR

WebSRC introduces a large dataset for web-based structural reading comprehension, challenging models to understand both textual and structural web page information to answer questions accurately.

Contribution

The paper presents WebSRC, a novel dataset with 400K QA pairs from web pages, incorporating HTML, screenshots, and metadata for structural comprehension tasks.

Findings

01

Structural information improves comprehension accuracy.

02

Baseline models find the task challenging, indicating room for improvement.

03

Visual features contribute to better answer prediction.

Abstract

Web search is an essential way for humans to obtain information, but it's still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of structural reading comprehension (SRC) on web. Given a web page and a question about it, the task is to find the answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages. Along with the QA pairs, corresponding HTML source code, screenshots, and metadata are also provided in our dataset. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

X-LANCE/WebSRC-Baseline
pytorch

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Topic Modeling · Misinformation and Its Impacts