A novel multi-threaded web crawling model

Weijie.Jiang

arXiv:2407.10440·cs.DB·July 16, 2024

A novel multi-threaded web crawling model

Weijie.Jiang

PDF

Open Access

TL;DR

This paper introduces a multi-threaded web crawling model that enhances large-scale web data collection efficiency by dividing tasks into concurrent threads, significantly outperforming traditional single-threaded methods.

Contribution

The paper presents a novel multi-threaded web crawling framework that improves data acquisition speed and efficiency for large-scale web data collection.

Findings

01

Model significantly optimized over single-threaded approaches

02

Concurrent threading improves crawling speed

03

Effective data buffering and parsing in multi-threaded environment

Abstract

This paper proposes a novel model for web crawling suitable for large-scale web data acquisition. This model first divides web data into several sub-data, with each sub-data corresponding to a thread task. In each thread task, web crawling tasks are concurrently executed, and the crawled data are stored in a buffer queue, awaiting further parsing. The parsing process is also divided into several threads. By establishing the model and continuously conducting crawler tests, it is found that this model is significantly optimized compared to single-threaded approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Distributed and Parallel Computing Systems · Advanced Malware Detection Techniques