Do we really need to catch them all? A new User-guided Social Media Crawling method
Fredrik Erlandsson, Piotr Br\'odka, Martin Boldt, and Henric Johnson

TL;DR
This paper introduces a user-guided social media crawling method that efficiently collects a large portion of interactions by sampling a small subset of posts, reducing time and resources needed for comprehensive data collection.
Contribution
The novel USMC method leverages crowd wisdom to prioritize content collection, achieving high coverage with significantly less crawling effort.
Findings
Covers approximately 75% of interactions with only 20% of posts
Reduces crawling time by 53%
Maintains similar network structure with less data
Abstract
With the growing use of popular social media services like Facebook and Twitter it is challenging to collect all content from the networks without access to the core infrastructure or paying for it. Thus, if all content cannot be collected one must consider which data are of most importance. In this work we present a novel User-guided Social Media Crawling method (USMC) that is able to collect data from social media, utilizing the wisdom of the crowd to decide the order in which user generated content should be collected to cover as many user interactions as possible. USMC is validated by crawling 160 public Facebook pages, containing content from 368 million users including 1.3 billion interactions, and it is compared with two other crawling methods. The results show that it is possible to cover approximately 75% of the interactions on a Facebook page by sampling just 20% of its posts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
