NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction
Soyeon Kim, Namhee Kim, Yeonwoo Jeong

TL;DR
This paper introduces a comprehensive evaluation framework for web data record extraction, enabling fair comparison of traditional algorithms and LLM-based methods across diverse datasets with improved metrics and input formats.
Contribution
It presents a new evaluation framework with dataset generation, annotation, and structure-aware metrics, along with preprocessing strategies and a synthetic dataset for benchmarking extraction methods.
Findings
LLM with Flat JSON input achieves F1 score of 0.9567
Flat JSON input reduces hallucination in LLM extractions
Benchmarking shows LLMs outperform traditional algorithms with the new framework
Abstract
Effective evaluation of web data record extraction methods is crucial, yet hampered by static, domain-specific benchmarks and opaque scoring practices. This makes fair comparison between traditional algorithmic techniques, which rely on structural heuristics, and Large Language Model (LLM)-based approaches, offering zero-shot extraction across diverse layouts, particularly challenging. To overcome these limitations, we introduce a concrete evaluation framework. Our framework systematically generates evaluation datasets from arbitrary MHTML snapshots, annotates XPath-based supervision labels, and employs structure-aware metrics for consistent scoring, specifically preventing text hallucination and allowing only for the assessment of positional hallucination. It also incorporates preprocessing strategies to optimize input for LLMs while preserving DOM semantics: HTML slimming,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
