AXE: Low-Cost Cross-Domain Web Structured Information Extraction

Abdelrahman Mansour; Khaled W. Alshaer; Moataz Elsaban

arXiv:2602.01838·cs.CL·April 1, 2026

AXE: Low-Cost Cross-Domain Web Structured Information Extraction

Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban

PDF

1 Repo

TL;DR

AXE is a low-cost, high-performance web data extraction pipeline that uses tree pruning and grounding techniques to enable small models to achieve state-of-the-art results.

Contribution

The paper introduces AXE, a novel pruning-based extraction pipeline with grounding, enabling small models to outperform larger ones in web structured data extraction.

Findings

01

AXE achieves 88.1% F1 on SWDE dataset.

02

AXE outperforms larger models in zero-shot extraction.

03

Code and adaptors are publicly available at the provided GitHub URL.

Abstract

Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abdo-Mansour/axetract
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.