Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts
Arpita Roy, Youngja Park, SHimei Pan

TL;DR
This paper introduces a novel approach for learning high-quality domain-specific word embeddings from sparse cybersecurity texts by leveraging domain knowledge and semantic relations.
Contribution
The paper presents a new framework and the Word Annotation Embedding (WAE) algorithm to incorporate domain knowledge into word embeddings for sparse texts.
Findings
Effective in learning domain-specific embeddings from sparse cybersecurity texts
Improves NLP task performance in cybersecurity applications
Validated on malware and CVE corpora
Abstract
Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifi-cally, we first propose a general framework to encode diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Spam and Phishing Detection
MethodsGloVe Embeddings
