Automatic Labeling for Entity Extraction in Cyber Security

Robert A. Bridges; Corinne L. Jones; Michael D. Iannacone; Kelly M.; Testa; John R. Goodall

arXiv:1308.4941·cs.IR·June 11, 2014·74 cites

Automatic Labeling for Entity Extraction in Cyber Security

Robert A. Bridges, Corinne L. Jones, Michael D. Iannacone, Kelly M., Testa, John R. Goodall

PDF

Open Access 3 Repos

TL;DR

The paper presents an automated labeling method for cyber-security entity extraction that leverages domain-specific structured data, enabling rapid training of high-accuracy models with minimal manual annotation.

Contribution

It introduces a novel automatic labeling technique using structured data and provides a publicly available annotated corpus for cyber-security entities.

Findings

01

Achieved near-perfect precision, recall, and accuracy.

02

Training times under 17 seconds on a large corpus.

03

Enabled rapid model development with minimal manual labeling.

Abstract

Timely analysis of cyber-security information necessitates automated information extraction from unstructured text. While state-of-the-art extraction methods produce extremely accurate results, they require ample training data, which is generally unavailable for specialized applications, such as detecting security related entities; moreover, manual annotation of corpora is very costly and often not a viable solution. In response, we develop a very precise method to automatically label text from several data sources by leveraging related, domain-specific, structured data and provide public access to a corpus annotated with cyber-security entities. Next, we implement a Maximum Entropy Model trained with the average perceptron on a portion of our corpus ( $\sim$ 750,000 words) and achieve near perfect precision, recall, and accuracy, with training times under 17 seconds.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management