"I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data
Andrea Failla, Giulio Rossetti

TL;DR
This paper introduces a comprehensive dataset from Bluesky Social, capturing user interactions, content, and algorithm outputs, enabling advanced analysis of online social behavior and misinformation spread.
Contribution
The authors release a large, high-coverage social media dataset from Bluesky, including posts, interactions, and algorithm outputs, addressing data scarcity issues in computational social science.
Findings
Enables analysis of online behavior and engagement patterns
Provides ground-truth data for studying content virality
Facilitates research on misinformation and content diffusion
Abstract
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Analysis and Archiving · Big Data Technologies and Applications · Data Quality and Management
