LiveWeb-IE: A Benchmark For Online Web Information Extraction

Seungbin Yang; Jihwan Kim; Jaemin Choi; Dongjin Kim; Soyoung Yang; ChaeHun Park; Jaegul Choo

arXiv:2603.13773·cs.CL·March 17, 2026

LiveWeb-IE: A Benchmark For Online Web Information Extraction

Seungbin Yang, Jihwan Kim, Jaemin Choi, Dongjin Kim, Soyoung Yang, ChaeHun Park, Jaegul Choo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LiveWeb-IE, a live website benchmark for evaluating web information extraction systems in real-world, dynamic scenarios, and proposes a novel visual grounding framework to improve extraction robustness.

Contribution

It presents a new live website benchmark for WIE evaluation and a multi-stage visual grounding framework that mimics human cognition for improved extraction accuracy.

Findings

01

VGS outperforms baseline models in robustness and accuracy.

02

The benchmark enables granular assessment across different complexity levels.

03

LiveWeb-IE bridges the gap between offline benchmarks and real-world web scenarios.

Abstract

Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The methodology is described well and the paper is easy to read. 2. The experimental results are pretty comprehensive, with a variety of backbone LLMs used. 3. The LiveWeb-IE benchmark can be a very valuable resource to the research community.

Weaknesses

1. The novelty of the VGS approach is limited. The method mainly incorporates VLMs for prompting to narrow down relevant web elements, with the XPath generation part already being done in prior work [1]. 2. The related work section is pretty lacking, with no discussion of distinctions/comparisons of VGS with prior WebIE methodologies. 3. While VGS is relatively performant, the authors should also show a cost comparison with prior baselines. The approach of iteratively pass all regions of the we

Reviewer 02Rating 6Confidence 4

Strengths

-Empirical results show that the newly proposed benchmark is more challenging, and the performance on state-of-the-art LLM/LMMs are less saturated; showing a gap between WIE systems and humans on more up-to-date live websites. -The proposed VGS framework is effective on both closed-source and open-source models -The paper writing is clear and contains sufficient ablations on the VGS components

Weaknesses

-While the LiveWeb-IE benchmark is claimed to be “evaluating directly against live websites”, it is not clear how the benchmark automatically evolves as the website updates over time. The dataset construction pipeline is still based on a snapshot of a certain time and requires human verification to curate the data. It is potentially an overclaim that the benchmark is “Live”. -It is also not clear how to handle layout changes through time while still keeping the evaluation/annotation valid; and h

Reviewer 03Rating 6Confidence 3

Strengths

The proposed Visual Grounding Scraper (VGS) framework that mimics human information-seeking behavior on web pages is novel and practical. Multi-stage visual grounding (region → element → XPath) effectively reduces HTML noise, achieving great performance on both LIVEWEB-IE and other offline benchmarks.

Weaknesses

1. Weak Motivation. While the paper argues that performance on offline benchmarks fails to generalize to live websites due to temporal changes in web structures, this claim lacks sufficient empirical evidence. For instance, there is no direct comparison showing how existing methods degrade over time on the same websites, nor quantitative data on the frequency or impact of such changes. This undermines the core motivation, as it's unclear whether the offline-to-online gap is as significant as ass

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Advanced Image and Video Retrieval Techniques