Tree-based Focused Web Crawling with Reinforcement Learning
Andreas Kontogiannis, Dimitrios Kelesis, Vasilis Pollatos, George Giannakopoulos, Georgios Paliouras

TL;DR
This paper introduces TRES, a reinforcement learning framework for focused web crawling that maximizes relevant page and domain discovery efficiently by modeling the problem as an MDP and employing a novel tree-based sampling algorithm.
Contribution
The paper presents TRES, a new RL-based focused crawling approach with a tree-frontier sampling algorithm, modeling the task as an MDP for improved efficiency and effectiveness.
Findings
TRES outperforms state-of-the-art methods in harvest rate and relevant domain retrieval.
It significantly reduces the number of URLs evaluated per crawling step.
TRES demonstrates provable efficiency in large state and action spaces.
Abstract
A focused crawler aims at discovering as many web pages and web sites relevant to a target topic as possible, while avoiding irrelevant ones. Reinforcement Learning (RL) has been a promising direction for optimizing focused crawling, because RL can naturally optimize the long-term profit of discovering relevant web locations within the context of a reward. In this paper, we propose TRES, a novel RL-empowered framework for focused crawling that aims at maximizing both the number of relevant web pages (aka \textit{harvest rate}) and the number of relevant web sites (\textit{domains}). We model the focused crawling problem as a novel Markov Decision Process (MDP), which the RL agent aims to solve by determining an optimal crawling strategy. To overcome the computational infeasibility of exhaustively searching for the best action at each time step, we propose Tree-Frontier, a provably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Optimization and Search Problems · Software Engineering Research
