Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
Mengjie Liu, Jiahui Peng, Wenchang Ning, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, Conghui He

TL;DR
Dripper introduces a lightweight, efficient framework for main HTML content extraction using small language models, outperforming heuristics and rivaling large models, with a new benchmark and open-source tools.
Contribution
It reformulates extraction as a constrained sequence labeling task with SLMs, creating a highly efficient and accurate method that surpasses traditional heuristics and large models.
Findings
Dripper achieves 3.08 pages/sec on a single GPU.
The Dripper-0.6B model outperforms heuristics and rivals large models.
Pre-training on Dripper-curated data improves downstream task performance.
Abstract
High-quality main content extraction from web pages is a critical prerequisite for constructing large-scale training corpora. While traditional heuristic extractors are efficient, they lack the semantic reasoning required to handle the structural heterogeneity of the modern web. Conversely, well-pretrained generative Large Language Models (LLMs) offer superior document comprehension but are prohibited by excessive computational costs, limited context windows, and hallucination risks when applied at web scale. We present \textbf{Dripper}, a lightweight framework that resolves these bottlenecks through four contributions: (1) We reformulate extraction as a \textbf{constrained sequence labeling} task using SLMs (Small Language Models). This paradigm eliminates generative hallucinations and achieves exceptional efficiency, reaching a throughput of 3.08 pages per second on a single A100 GPU.…
Peer Reviews
Decision·Submitted to ICLR 2026
- Efficient approach to improve context extraction from HTML pages. - The approach is generally well motivated. - Paper introduces a more comprehensive benchmark for evaluating HTML extraction.
- The described constrain decoding mechanism is framed as a contribution; however, constraint decoding via a grammar is a built in feature in major inference, so this aspect of the pipeline, while sensible, it fairly straightforward. - Futher, Seciotn 5.4 shows that the mechanism has limited impact once enough supervision is provided. - The proposed benchmark is comprised of 90% randomly sampled websites, and 10% head distribution websites. It would have been useful to get more statistics abo
Very solid submission and I’m appreciative of work like this that goes the full effort to build something usable in the real world (not just hill climbing on others’ benchmarks, but curating a good benchmark themselves; not just training a model but designing a system that can use the model, with considerations of deployment efficiency). The results speak for themselves; impressive improvement over past work on an important but under-appreciated problem. This paper should be accepted. I’ve seen
I think this paper should be accepted. That being said, here are some aspects that bothered me a bit while reading the submission. I would really appreciate if the authors can revise accordingly: First, I think it’s important for the authors to provide a discussion of relation to prior work in block sequence classification. For clean content extraction of structured, layout-rich documents, this has been done before. For example, “VILA: Improving Structured Content Extraction from Scientific PDF
- The pipeline, HTML simplification + block classification + controlled decoding, directly solves efficiency and hallucination issues. - Comprehensive experiments including 12 baseline methods and 2 datasets ensure the performance of Dripper. and public code, models, and datasets ensure reproducibility. - MainWebBench offers community value on model evaluation, which is 7x larger than existing datasets, with complete annotations. - The 0.6B-parameter model suits large-scale processing, meeting r
- The presentation of figures, equations, and tables in this paper has inconsistencies affecting clarity. For example, the color scheme of Fig. 2 is confusing and lacks a corresponding description. There is a formatting mismatch between the description of Eq. (1) and the equation itself. Additionally, the captions for tables are overly simplistic, making it difficult to understand the tables. - Lack of analysis on multilingual (e.g., Chinese) and domain generalization; MainWebBench has no domain
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Text Readability and Simplification · Handwritten Text Recognition Techniques
