Web Page Classification using LLMs for Crawling Support
Yuichi Sasazawa, Yasuhiro Sogawa

TL;DR
This paper introduces a method using large language models to classify web pages into index and content types, improving web crawling efficiency by better selecting starting points for discovering new pages.
Contribution
The study presents a novel approach leveraging LLMs for web page classification to enhance crawling strategies, with an automatically annotated dataset and evaluation of classification and coverage performance.
Findings
LLM-based classification outperforms baseline methods in accuracy.
Improved coverage of new pages in web crawling.
Effective use of automatically annotated datasets.
Abstract
A web crawler is a system designed to collect web pages, and efficient crawling of new pages requires appropriate algorithms. While website features such as XML sitemaps and the frequency of past page updates provide important clues for accessing new pages, their universal application across diverse conditions is challenging. In this study, we propose a method to efficiently collect new pages by classifying web pages into two types, "Index Pages" and "Content Pages," using a large language model (LLM), and leveraging the classification results to select index pages as starting points for accessing new pages. We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages. Experimental results demonstrate that the LLM-based method outperformed baseline methods in both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Text and Document Classification Technologies · Web visibility and informetrics
