Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

Arpita Roy; Youngja Park; SHimei Pan

arXiv:1709.07470·cs.CL·September 25, 2017·32 cites

Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

Arpita Roy, Youngja Park, SHimei Pan

PDF

Open Access

TL;DR

This paper introduces a novel approach for learning high-quality domain-specific word embeddings from sparse cybersecurity texts by leveraging domain knowledge and semantic relations.

Contribution

The paper presents a new framework and the Word Annotation Embedding (WAE) algorithm to incorporate domain knowledge into word embeddings for sparse texts.

Findings

01

Effective in learning domain-specific embeddings from sparse cybersecurity texts

02

Improves NLP task performance in cybersecurity applications

03

Validated on malware and CVE corpora

Abstract

Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifi-cally, we first propose a general framework to encode diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Spam and Phishing Detection

MethodsGloVe Embeddings