The Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine
Qingpei Guo, Chao Xu, Yang Song

TL;DR
This paper presents a distributed search engine component using Hadoop for crawling and GraphLite for real-time PageRank calculation, addressing scalability and efficiency challenges in web data processing.
Contribution
It introduces a Hadoop-based crawler system combined with GraphLite for real-time PageRank computation, enhancing scalability and efficiency in search engine operations.
Findings
Distributed crawler system improves data collection efficiency.
Real-time PageRank calculation accelerates index updating.
System effectively handles large-scale web data processing.
Abstract
Nowadays, the size of the Internet is experiencing rapid growth. As of December 2014, the number of global Internet websites has more than 1 billion and all kinds of information resources are integrated together on the Internet, however,the search engine is to be a necessary tool for all users to retrieve useful information from vast amounts of web data. Generally speaking, a complete search engine includes the crawler system, index building systems, sorting systems and retrieval system. At present there are many open source implementation of search engine, such as lucene, solr, katta, elasticsearch, solandra and so on. The crawler system and sorting system is indispensable for any kind of search engine and in order to guarantee its efficiency, the former needs to update crawled vast amounts of data and the latter requires real-time to build index on newly crawled web pages and calculae…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Cloud Computing and Resource Management · Graph Theory and Algorithms
