A Structure-Oriented Unsupervised Crawling Strategy for Social Media   Sites

Keyang Xu; Kyle Yingkai Gao; Jamie Callan

arXiv:1804.02734·cs.IR·April 10, 2018

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Keyang Xu, Kyle Yingkai Gao, Jamie Callan

PDF

Open Access

TL;DR

This paper introduces SOUrCe, an unsupervised, structure-based crawling method for social media sites that learns site structure to improve crawling efficiency and focus on user-generated content.

Contribution

It presents a novel two-stage approach combining structural clustering and navigation table generation for efficient, focused social media crawling without supervision.

Findings

01

Outperforms baseline methods in staying focused on user content

02

Supports various crawling styles effectively

03

Efficiently constructs site structure for targeted crawling

Abstract

Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Caching and Content Delivery · Spam and Phishing Detection