Sensitive Information Detection: Recursive Neural Networks for Encoding   Context

Jan Neerbek

arXiv:2008.10863·cs.LG·August 26, 2020·1 cites

Sensitive Information Detection: Recursive Neural Networks for Encoding Context

Jan Neerbek

PDF

Open Access

TL;DR

This paper introduces a novel deep learning approach using recursive neural networks for detecting sensitive information in unstructured text, significantly outperforming previous keyword-based methods.

Contribution

The paper develops a context-aware recursive neural network model for sensitive information detection that relies solely on labeled examples, avoiding rule-based limitations.

Findings

01

Deep neural models outperform keyword-based methods

02

Context-based detection improves accuracy on real-world data

03

Approach requires only labeled examples, not rules or seed words

Abstract

The amount of data for processing and categorization grows at an ever increasing rate. At the same time the demand for collaboration and transparency in organizations, government and businesses, drives the release of data from internal repositories to the public or 3rd party domain. This in turn increase the potential of sharing sensitive information. The leak of sensitive information can potentially be very costly, both financially for organizations, but also for individuals. In this work we address the important problem of sensitive information detection. Specially we focus on detection in unstructured text documents. We show that simplistic, brittle rule sets for detecting sensitive information only find a small fraction of the actual sensitive information. Furthermore we show that previous state-of-the-art approaches have been implicitly tailored to such simplistic scenarios and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Spam and Phishing Detection · Text and Document Classification Technologies