A novel multi-threaded web crawling model
Weijie.Jiang

TL;DR
This paper introduces a multi-threaded web crawling model that enhances large-scale web data collection efficiency by dividing tasks into concurrent threads, significantly outperforming traditional single-threaded methods.
Contribution
The paper presents a novel multi-threaded web crawling framework that improves data acquisition speed and efficiency for large-scale web data collection.
Findings
Model significantly optimized over single-threaded approaches
Concurrent threading improves crawling speed
Effective data buffering and parsing in multi-threaded environment
Abstract
This paper proposes a novel model for web crawling suitable for large-scale web data acquisition. This model first divides web data into several sub-data, with each sub-data corresponding to a thread task. In each thread task, web crawling tasks are concurrently executed, and the crawled data are stored in a buffer queue, awaiting further parsing. The parsing process is also divided into several threads. By establishing the model and continuously conducting crawler tests, it is found that this model is significantly optimized compared to single-threaded approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Distributed and Parallel Computing Systems · Advanced Malware Detection Techniques
