Social Media Data Toolkit: Standardization and Anonymization of Social Network Datasets

Ali Najafi; Letizia Iannucci; Mikko Kivel\"a; Onur Varol

arXiv:2604.27710·cs.SI·May 1, 2026

Social Media Data Toolkit: Standardization and Anonymization of Social Network Datasets

Ali Najafi, Letizia Iannucci, Mikko Kivel\"a, Onur Varol

PDF

1 Repo

TL;DR

The paper introduces a Python toolkit for standardizing, anonymizing, and enriching social media datasets to facilitate cross-platform analysis and research reproducibility.

Contribution

It presents a comprehensive framework that unifies diverse social media data structures and integrates anonymization and enrichment modules for multi-platform research.

Findings

01

Unifies heterogeneous social media datasets into a common schema.

02

Includes configurable anonymization to protect PII.

03

Supports enrichment with LLMs and network analysis for downstream tasks.

Abstract

The rapid diversification of social media platforms and the increasing restrictions on official APIs have significantly complicated cross-platform analysis. Researchers are often forced to rely on heterogeneous datasets obtained through web scraping and historical archives; however they often lack structural consistency. Prior to conducting cross-platform social media analyses, one needs to answer three critical questions: (1) What makes platforms different and similar? (2) How were the datasets collected? (3) How can we align the datasets of different platforms to conduct fair analyses? To address these questions, we introduce the Social Media Data Toolkit (\projectname{}), a comprehensive Python framework designed for the standardization, anonymization, and enrichment of social network datasets. \projectname{} unifies diverse data structures into a generic schema comprising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ViralLab/SMDT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.