# Automated U.S Diplomatic Cables Security Classification: Topic Model   Pruning vs. Classification Based on Clusters

**Authors:** Khudran Alzhrani, Ethan M. Rudd, C. Edward Chow, and Terrance E. Boult

arXiv: 1703.02248 · 2017-03-08

## TL;DR

This paper compares two approaches for automatically classifying the security level of sensitive unstructured text data, aiming to improve data leak prevention in government environments.

## Contribution

It evaluates and contrasts topic model pruning and classification based on clusters for text security classification using real WikiLeaks data.

## Key findings

- Topic model pruning and cluster-based classification show different strengths.
- Both methods can effectively identify sensitive texts in real datasets.
- The study provides insights into the suitability of each approach for security applications.

## Abstract

The U.S Government has been the target for cyber-attacks from all over the world. Just recently, former President Obama accused the Russian government of the leaking emails to Wikileaks and declared that the U.S. might be forced to respond. While Russia denied involvement, it is clear that the U.S. has to take some defensive measures to protect its data infrastructure. Insider threats have been the cause of other sensitive information leaks too, including the infamous Edward Snowden incident. Most of the recent leaks were in the form of text. Due to the nature of text data, security classifications are assigned manually. In an adversarial environment, insiders can leak texts through E-mail, printers, or any untrusted channels. The optimal defense is to automatically detect the unstructured text security class and enforce the appropriate protection mechanism without degrading services or daily tasks. Unfortunately, existing Data Leak Prevention (DLP) systems are not well suited for detecting unstructured texts. In this paper, we compare two recent approaches in the literature for text security classification, evaluating them on actual sensitive text data from the WikiLeaks dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.02248/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/1703.02248/full.md

## References

9 references — full list in the complete paper: https://tomesphere.com/paper/1703.02248/full.md

---
Source: https://tomesphere.com/paper/1703.02248