Examining the Feasibility of Off-the-Shelf Algorithms for Masking Directly Identifiable Information in Social Media Data
Rachel Dorn, Alicia L. Nobles, Masoud Rouhizadeh, Mark Dredze

TL;DR
This study evaluates the effectiveness of existing off-the-shelf algorithms in identifying and removing directly identifiable information from social media data, specifically tweets, to address privacy concerns.
Contribution
The paper introduces an annotated tweet dataset and a tool called Nightjar to assess the feasibility of using existing algorithms for privacy preservation in social media data.
Findings
Nightjar successfully identifies identifiable information in tweets
Annotated dataset provides a benchmark for future research
Feasibility of off-the-shelf algorithms is demonstrated
Abstract
The identification and removal/replacement of protected information from social media data is an understudied problem, despite being desirable from an ethical and legal perspective. This paper identifies types of potentially directly identifiable information (inspired by protected health information in clinical texts) contained in tweets that may be readily removed using off-the-shelf algorithms, introduces an English dataset of tweets annotated for identifiable information, and compiles these off-the-shelf algorithms into a tool (Nightjar) to evaluate the feasibility of using Nightjar to remove directly identifiable information from the tweets. Nightjar as well as the annotated data can be retrieved from https://bitbucket.org/mdredze/nightjar.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Ethics in Clinical Research · Privacy, Security, and Data Protection
