PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research
Nick Oh, Giorgos D. Vrakas, Si\^an J. M. Brooke, Sasha Morini\`ere, Toju Duke

TL;DR
PETLP is a comprehensive framework that integrates legal compliance into social media data processing pipelines for AI research, addressing GDPR, copyright, and platform rules.
Contribution
It introduces PETLP, a novel privacy-by-design pipeline that embeds legal safeguards and manages evolving data protection assessments for social media data.
Findings
Reddit analysis shows different extraction rights for research vs. commercial entities.
True anonymisation of social media data is unachievable.
Legal gaps exist between dataset creation and model distribution.
Abstract
Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms -- yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
