AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma; Jiantao Qiu; Chao Xu; Pei Chu; Kaiwen Liu; Pengli Ren; Yuan Qu; Jiahui Peng; Linfeng Hou; Mengjie Liu; Lindong Lu; Wenchang Ning; Jia Yu; Rui Min; Jin Shi; Haojiong Chen; Peng Zhang; Wenjian Zhang; Qian Jiang; Zengjie Hu; Guoqiang Yang; Zhenxiang Li; Fukai Shang; Runyuan Ma; Chenlin Su; Zhongying Tu; Wentao Zhang; Dahua Lin; Conghui He

arXiv:2511.16397·cs.CL·November 27, 2025

AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces MinerU-HTML, a model-based HTML content extractor that significantly improves structure preservation in web data, leading to a large, high-quality multilingual corpus that enhances language model performance.

Contribution

The paper presents MinerU-HTML, a scalable, model-based HTML extraction pipeline that outperforms heuristic methods and enables the construction of a large, high-quality AI-ready web corpus.

Findings

01

MinerU-HTML achieves 81.8% ROUGE-N F1 on MainWebBench, outperforming Trafilatura.

02

Models trained on AICC outperform those trained on TfCC and other web corpora.

03

Extraction quality directly impacts downstream language model performance.

Abstract

While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
opendatalab/MinerU-HTML-v1.1-hunyuan0.5B-compact
model· 95 dl· ♡ 2
95 dl♡ 2

Datasets

opendatalab/AICC
dataset· 15k dl
15k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques · Web Application Security Vulnerabilities