Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution
Robert Dilworth

TL;DR
This paper investigates how homoglyph substitution can be used as an adversarial technique to protect personal information in text from stylometric analysis.
Contribution
It introduces homoglyph substitution as a novel method to hinder stylometric identification and privacy breaches in textual data.
Findings
Homoglyph substitution degrades stylometric system accuracy.
The method effectively obscures author identity and demographic information.
It offers a potential privacy-preserving technique for textual data sharing.
Abstract
In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
