Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
Carlos Jimeno Miguel, Raul Orduna, Francesco Zola

TL;DR
This paper presents a system for collecting and anonymizing social media data, including text, audio, and images, using speech-to-text and NER models to comply with GDPR and legal standards.
Contribution
It introduces an integrated approach combining signal enhancement, transformer-based NER, and anonymization metrics for privacy-preserving cybercrime data collection.
Findings
Parakeet achieves the best audio transcription performance.
Proposed NER solutions attain high f1-score in detecting sensitive info.
Anonymization metrics effectively balance data utility and privacy.
Abstract
This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
