A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites
Keyang Xu, Kyle Yingkai Gao, Jamie Callan

TL;DR
This paper introduces SOUrCe, an unsupervised, structure-based crawling method for social media sites that learns site structure to improve crawling efficiency and focus on user-generated content.
Contribution
It presents a novel two-stage approach combining structural clustering and navigation table generation for efficient, focused social media crawling without supervision.
Findings
Outperforms baseline methods in staying focused on user content
Supports various crawling styles effectively
Efficiently constructs site structure for targeted crawling
Abstract
Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Caching and Content Delivery · Spam and Phishing Detection
