An Index-based Approach for Efficient and Effective Web Content Extraction
Yihan Chen, Benfeng Xu, Xiaorui Wang, Zhendong Mao

TL;DR
This paper presents an index-based web content extraction method that improves efficiency and accuracy by predicting content indices instead of generating text, enabling rapid, query-relevant extraction for large web pages.
Contribution
The paper introduces a novel index-based approach that partitions HTML into segments and predicts relevant content indices, significantly enhancing extraction speed and accuracy over existing methods.
Findings
Outperforms existing methods in accuracy and speed
Improves QA accuracy in RAG systems
Effectively extracts main and query-relevant content
Abstract
As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Information Retrieval and Search Behavior · Advanced Text Analysis Techniques
