Examining the Feasibility of Off-the-Shelf Algorithms for Masking   Directly Identifiable Information in Social Media Data

Rachel Dorn; Alicia L. Nobles; Masoud Rouhizadeh; Mark Dredze

arXiv:2011.08324·cs.HC·November 18, 2020·1 cites

Examining the Feasibility of Off-the-Shelf Algorithms for Masking Directly Identifiable Information in Social Media Data

Rachel Dorn, Alicia L. Nobles, Masoud Rouhizadeh, Mark Dredze

PDF

Open Access 1 Repo

TL;DR

This study evaluates the effectiveness of existing off-the-shelf algorithms in identifying and removing directly identifiable information from social media data, specifically tweets, to address privacy concerns.

Contribution

The paper introduces an annotated tweet dataset and a tool called Nightjar to assess the feasibility of using existing algorithms for privacy preservation in social media data.

Findings

01

Nightjar successfully identifies identifiable information in tweets

02

Annotated dataset provides a benchmark for future research

03

Feasibility of off-the-shelf algorithms is demonstrated

Abstract

The identification and removal/replacement of protected information from social media data is an understudied problem, despite being desirable from an ethical and legal perspective. This paper identifies types of potentially directly identifiable information (inspired by protected health information in clinical texts) contained in tweets that may be readily removed using off-the-shelf algorithms, introduces an English dataset of tweets annotated for identifiable information, and compiles these off-the-shelf algorithms into a tool (Nightjar) to evaluate the feasibility of using Nightjar to remove directly identifiable information from the tweets. Nightjar as well as the annotated data can be retrieved from https://bitbucket.org/mdredze/nightjar.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://bitbucket.org/mdredze/nightjar
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Ethics in Clinical Research · Privacy, Security, and Data Protection