Detecting Cybersecurity Events from Noisy Short Text
Semih Yagcioglu, Mehmet Saygin Seyfioglu, Begum Citamak, Batuhan, Bardak, Seren Guldamlasioglu, Azmi Yuksel, Emin Islam Tatli

TL;DR
This paper introduces a neural network-based approach combining domain-specific embeddings and task features to detect cybersecurity events from noisy social media texts, specifically tweets, outperforming traditional methods.
Contribution
It presents a novel CNN-LSTM model utilizing meta-embeddings and contextual features for cybersecurity event detection in noisy short texts, along with a new annotated Twitter dataset.
Findings
Proposed model outperforms traditional baselines
Effective detection of cybersecurity events from noisy tweets
New annotated dataset of cybersecurity-related tweets
Abstract
It is very critical to analyze messages shared over social networks for cyber threat intelligence and cyber-crime prevention. In this study, we propose a method that leverages both domain-specific word embeddings and task-specific features to detect cyber security events from tweets. Our model employs a convolutional neural network (CNN) and a long short-term memory (LSTM) recurrent neural network which takes word level meta-embeddings as inputs and incorporates contextual embeddings to classify noisy short text. We collected a new dataset of cyber security related tweets from Twitter and manually annotated a subset of 2K of them. We experimented with this dataset and concluded that the proposed model outperforms both traditional and neural baselines. The results suggest that our method works well for detecting cyber security events from noisy short text.
| Models | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| SVM+BoW | 0.75 | 0.71 | 0.70 | 0.70 |
| SVM+Meta-Emcoder | 0.71 | 0.64 | 0.61 | 0.63 |
| CNN-static (Yoon Kim, 2014) | 0.76 | 0.72 | 0.69 | 0.70 |
| Human | 0.65 | 0.70 | 0.87 | 0.59 |
| CNN+Meta-Encoder | 0.78 | 0.78 | 0.63 | 0.70 |
| LSTM+Meta-Encoder | 0.78 | 0.74 | 0.70 | 0.72 |
| Ours (see Fig. 2) | 0.82 | 0.79 | 0.72 | 0.76 |
| Tweet | Our Model | GT | ||
|---|---|---|---|---|
|
0 | 0 | ||
|
1 | 1 | ||
|
1 | 0 | ||
|
0 | 1 | ||
|
0 | 1 | ||
|
0 | 1 | ||
|
1 | 0 |
| Subjects | Accuracy | Precision | Recall | F1 | Cohen’s |
|---|---|---|---|---|---|
| #1 | 0.62 | 0.54 | 1 | 0.7 | 0.43 |
| #2 | 0.54 | 0.5 | 0.95 | 0.65 | 0.33 |
| #3 | 0.66 | 0.58 | 0.91 | 0.71 | 0.42 |
| #4 | 0.66 | 0.57 | 1 | 0.73 | 0.46 |
| #5 | 0.8 | 0.8 | 0.73 | 0.77 | 0.28 |
| #6 | 0.66 | 0.57 | 0.95 | 0.72 | 0.41 |
| #7 | 0.7 | 0.63 | 0.82 | 0.71 | 0.31 |
| #8 | 0.6 | 0.56 | 0.60 | 0.58 | 0.28 |
| Average | 0.65 | 0.70 | 0.87 | 0.59 | 0.36 |
| Hyperparameter | value | |
| general | vector_size | 100 |
| LDA | num_topics | 40 |
| update_every | 1 | |
| chunksize | 10000 | |
| passes | 1 | |
| w2v & fastText | window_size | 5 |
| min_count | 5 | |
| iter | 5 | |
| alpha | 0.025 | |
| GloVe | window_size | 5 |
| no_components | 100 | |
| learning_rate | 0.01 | |
| epoch_num | 10 | |
| Autoencoder | nb_epoch | 100 |
| batch_size | 100 | |
| shuffle | True | |
| validation_split | 0.1 | |
| CRF | learning_rate | 0.01 |
| l2 regularization | 1e-2 |
| Features | Accuracy |
|---|---|
| All | 0.725 |
| NER & LDA | 0.705 |
| LDA & IE | 0.69 |
| NER & IE | 0.71 |
| IE | 0.68 |
| NER | 0.64 |
| LDA | 0.66 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCybercrime and Law Enforcement Studies · Misinformation and Its Impacts · Spam and Phishing Detection
Detecting Cybersecurity Events from Noisy Short Text
Semih Yagcioglu, Mehmet Saygin Seyfioglu, Begum Citamak, Batuhan Bardak
**Seren Guldamlasioglu, Azmi Yuksel, Emin Islam Tatli
**STM A.Ş., Ankara, Turkey
{syagcioglu, msaygin.seyfioglu, begum.citamak, batuhan.bardak,
sguldamlasioglu, azyuksel, emin.tatli} @stm.com.tr Corresponding author.
Abstract
It is very critical to analyze messages shared over social networks for cyber threat intelligence and cyber-crime prevention. In this study, we propose a method that leverages both domain-specific word embeddings and task-specific features to detect cyber security events from tweets. Our model employs a convolutional neural network (CNN) and a long short-term memory (LSTM) recurrent neural network which takes word level meta-embeddings as inputs and incorporates contextual embeddings to classify noisy short text. We collected a new dataset of cyber security related tweets from Twitter and manually annotated a subset of 2K of them. We experimented with this dataset and concluded that the proposed model outperforms both traditional and neural baselines. The results suggest that our method works well for detecting cyber security events from noisy short text.
1 Introduction
Twitter has become a medium where people can share and receive timely messages on about anything. People share facts, opinions, broadcast news and communicate with each other through these messages. Due to the low barrier to tweeting, and growth in mobile device usage, tweets might provide valuable information as people often share instantaneous updates such as the breaking news before even being broadcasted in the newswire c.f. Petrović et al. (2010). People also share cyber security events in their tweets such as zero day exploits, ransomwares, data leaks, security breaches, vulnerabilities etc. Automatically detecting such events might have various practical applications such as taking the necessary precautions promptly as well as creating self-awareness as illustrated in Fig. 1. Recently, working with the cyber security related text has garnered a lot of interest in both computer security and natural language processing (NLP) communities (c.f. Joshi et al. (2013); Ritter et al. (2015); Roy et al. (2017)).
Nevertheless, detecting cyber security events from tweets pose a great challenge, as tweets are noisy and often lack sufficient context to discriminate cyber security events due to length limits. Recently, deep learning methods have shown to be outperforming traditional approaches in several NLP tasks Chen and Manning (2014); Bahdanau et al. (2014); Kim (2014); Hermann et al. (2015). Inspired by this progress, our goal is to detect cyber security events in tweets by learning domain-specific word embeddings and task-specific features using neural architectures. The key contribution of this work is two folds. First, we propose an end-to-end learning system to effectively detect cyber security events from tweets. Second, we propose a noisy short text dataset with annotated cyber security events for unsupervised and supervised learning tasks. To our best knowledge, this will be the first study that incorporates domain-specific meta-embeddings and contextual embeddings for detecting cyber security events.
2 Method
In the subsequent sections, we address the challenges to solve our task. The proposed system overview is illustrated in Fig. 2.
2.1 Meta-Embeddings
Word embedding methods might capture different semantic and syntactic features about the same word. To exploit this variety without losing the semantics, we learn meta-embeddings for words.
Word Embeddings. Word2vec Mikolov et al. (2013), GloVe Pennington et al. (2014), and fastText Joulin et al. (2016); Bojanowski et al. (2016) are trained for learning domain specific word embeddings on the unlabeled tweet corpus.
Meta-Encoder. Inspired by Yin and Schütze (2015) we learn meta-embeddings for words with the aforementioned word embeddings. We use a Convolutional Autoencoder Masci et al. (2011) for encoding size embeddings to a dimensional latent variable and to reconstruct the original embeddings from this latent variable. Both encoder and decoder are comprised of convolutional layers where neurons are used on each. The encoder part is shown in Fig. 3.
We argue that this network learns a much simpler mapping while capturing the semantic and syntactic relations from each of these embeddings, thus leading to a richer word-level representation. Another advantage of learning meta-embeddings for words is that the proposed architecture alleviates the Out-of-Vocabulary (OOV) embeddings problem, as we still get embeddings from the fastText channel, in contrast to GloVe and word2vec, where no embeddings are available for OOV words.
2.2 Contextual Embeddings
To capture the contextual information, we learn task-specific features from tweets.
LDA. Latent Dirichlet Allocation (LDA) is a generative probabilistic model to discover topics from a collection of documents Blei et al. (2003). LDA works in an unsupervised manner and learns a finite set of categories from a collection, thus represents documents as mixtures of topics. We train an LDA model to summarize each tweet by using the topic with the maximum likelihood e.g. with the topic “vulnerability” for the tweet in Fig 1.
NER.
Named Entity Recognition (NER) tags the specified named entities from raw text into pre-defined categories. Named entities could be more general categories such as people, organizations, or specific entities can be learned by creating a dataset containing specific entity tags. We employ an automatically annotated dataset that contains entities from cyber security domain Bridges et al. (2013) to train our Conditional Random Field model using handcrafted features, i.e., uni-gram, bi-gram, and gazetteers. The dataset comprises of 850K tokens that contain named entities such as ‘Relevant Term’, ‘Operating System’,‘Hardware’, ‘Software’, ‘Vendor’, in the standard IOB-tagging format. Our NER model tags “password” as ‘Relevant Term’ and “Apple” as ‘Vendor’ for the tweet in Fig 1.
IE. Uncovering entities and the relations between those entities is an important task for detecting cyber security events. In order to address this we use Information Extraction (IE), in particular OpenIE annotatorAngeli et al. (2015) from the Stanford CoreNLP Manning et al. (2014). Subsequently, we extract relations between noun phrases with the following dependency triplet , where , denote the arguments and represents an implicit semantic relation between those arguments. Hence, the following triplet is extracted from the tweet in Fig. 1, .
Contextual-Encoder. We use the outputs of LDA, NER and IE algorithms to obtain a combined vector representation using meta-embeddings described in Sec. 2.1. Thus, contextual embeddings are calculated as follows111We used zero vectors for the non-existent relations..
[TABLE]
where function extracts contextual embeddings and denotes a tweet, , , and represent meta-embedding, LDA, NER, and IE functions, respectively. Lastly, and denote the output tokens.
2.3 Event Detection
Inspired by the visual question answering task Antol et al. (2015), where different modalities are combined by CNNs and RNNs, we adopt a similar network architecture for our task. Prior to training, and inference we preprocess, normalize and tokenize each tweet as described in Sec. 3.
CNN. We employ a CNN model similar to that of Kim (2014) where we feed the network with static meta-embeddings. Our network is comprised of one convolutional layer with varying filter sizes, that is . All tweets are zero padded to the maximum tweet length. We use as activation and global max pooling at the end of CNN.
RNN. We use a bi-directional LSTM Hochreiter and Schmidhuber (1997) and read the input in both directions and concatenate forward and backward hidden states to encode the input as a sequence. Our LSTM model is comprised of a single layer and employs neurons.
3 Experiments
Data Collection. We collected tweets using the Twitter’s streaming API over a period from 2015-01-01 to 2017-12-31 using an initial set of keywords, henceforth referred as seed keywords to retrieve cyber security related tweets. In particular, we use the main group names of cyber security taxonomy described in Le Sceller et al. (2017) as seed keywords e.g. ‘denial of service’, ‘botnet’, ‘malware’, ‘vulnerability’, ‘phishing’, ‘data breach’ to retrieve relevant tweets. Using seed keywords is a practical way to filter out noise considering sparsity of cyber security related tweets in the whole tweet stream. After the initial retrieval, we use langid.py Lui and Baldwin (2012) to filter out non-English tweets.
Data Preprocessing. We substitute user handles with \mention$$url$emoticonsRT#character. We limit characters that repeat more than two times, remove capitalization and tokenize tweets using the Twitter tokenizer in nltk library. We normalize non-standard forms, *i.e*. writing *cu tmrrw* instead of *see you tomorrow*. Although there are several reasons for that, the most prominent one is that people tend to mimic prosodic effects in speech Eisenstein ([2013](#bib.bib11)). To overcome this, we use lexical normalization, where we substitute OOV tokens with in-Vocabulary (IV) standard forms, *i.e*. a standard form available in a dictionary. In particular we use UniMelb Han et al. ([2012](#bib.bib12)), UTDallas Liu et al. ([2011](#bib.bib20)) datasets. Lastly, we remove identical tweets and check the validity by removing tweets with less than3$ non-special tokens.
Data Annotation. We instructed cyber security domain experts for manual labelling of the dataset. Annotators are asked to provide a binary label for whether there is a cyber security event in the given tweet or not. Annotators are told to skip tweets if they are unsure about their decisions. Finally, we validated annotations by only accepting annotations if at least among annotators agreed on. Therefore, we presume the quality of attained ground truth labels is dependable. Overall, tweets are annotated.
Dataset Statistics. After preprocessing, our initial tweet dataset is reduced to tweets where of them are labeled222Available at https://stm-ai.github.io/. The labeled dataset is somewhat balanced as there are event-related tweets and non-event tweets. The training and testing sets have and samples, respectively.
Training. We used Keras with Tensorflow backend in our neural models. For fastText and word2vec embeddings we used Gensim, and for GloVe we used glove-python library. For training the word embeddings, we use the entire tweet text corpus and obtain dimensional word embeddings. We set word2vec and fastText model’s alpha parameter to and window size to . For GloVe embedding model, we set the learning rate to , alpha to and maximum count parameter to . For embedding models, we determined the minimum count parameter to , culminating in the elimination of infrequent words. Consequently, we have , -dimensional word embedding tensor in which first, second and third channels consist of word2vec, fastText and GloVe embeddings respectively. We then, encode these dimensional embeddings into dimensional representations by using our Meta-Encoder. We train our two channel architecture that combines both LSTM and CNN with inputs: meta-embeddings and contextual embeddings. We use meta-embeddings for feature learning via LSTM and CNN, and their feature maps are concatenated with contextual embeddings in the Fusion Layer. In the end, fully connected layers and a softmax classifier are added, and the whole network is trained to minimize binary cross entropy loss with a learning rate of 0.01 by using the Adam optimizer Kingma and Ba (2014).333See supplementary for hyperparameter choices.
Baselines. To compare with our results, we implemented the following baselines: SVM with BoW: We trained an SVM classifier using Bag-of-words (BoW) which provides a simplified representation of textual data by calculating the occurrence of words in a document. SVM with meta-embeddings: We trained an SVM classifier with the aforementioned meta-embeddings. CNN-Static: We used Kim (2014)’s approach using word2vec embeddings.
Results. Table 1 summarizes the overall performance of each method. To compare the models, we used four different metrics: accuracy, recall, precision and F1-score. Each reported result is the mean of a 5-fold cross validation experiment. It is clear that our method outperforms various simple and neural baselines. Also, in Table 2, we provide results of our proposed model along with the ground-truth annotations. We also provide results with the different combinations of contextual features, i.e., LDA, NER, IE444See supplementary for feature combination details..
Human Study. different subjects are thoroughly instructed about what is considered as a cyber security event and individually asked to label randomly selected tweets from the test set. The results are provided in Table 3.
Error Analysis. In order to understand how our system performs, we randomly select a set of erroneously classified instances from the test dataset. Type I Errors. Our model identifies this tweet as an event “uk warned following breach in air pollution regulation ” whereas it is clearly about the a breach of a regulation. We hypothesize that this is due to the lack of sufficient training data. Following tweet is also identified as an event “wannacry ransomware ransomwareattack ransomwarewannacry malware ”. We suspect that the weights of multiple relevant terms deceive the model.
Type II Errors. Our model fails to identify the following positive sample as an event. For “playstation network was the target of miraibotnet ddos attack guiding tech rss news feed search” our model fails to recognize the ’miraibotnet’ from the tweet. We suspect this is due to the lack of hashtag decomposition; otherwise, the model could recognize ‘mirai’ and ‘botnet’ as separate words.
Discussions. Cyber security related tweets are complicated and analysing them requires in-depth domain knowledge. Although human subjects are properly instructed, the results of the human study indicate that our task is challenging and humans can hardly discriminate cyber security events amongst cyber security related tweets. To further investigate this, we plan to increase the number of human subjects. One limitation of this study is that we do not consider hyperlinks and user handles which might provide additional information. One particular problem we have not addressed in this work is hashtag decomposition. Error analysis indicates that our model might get confused by challenging examples due to ambiguities and lack of context.
4 Related Work
Event detection on Twitter is studied extensively in the literature Petrović et al. (2010); Sakaki et al. (2010); Weng and Lee (2011); Ritter et al. (2012); Yuan et al. (2013); Atefeh and Khreich (2015). Banko et al. (2007) proposed a method to extract relational tuples from web corpus without requiring hand labeled data. Ritter et al. (2012) proposed a method for categorizing events in Twitter. Luo et al. (2015) suggested an approach to infer binary relations produced by open IE systems. Recently, Ritter et al. (2015) introduced the first study to extract event mentions from a raw Twitter stream for event categories DDoS attacks, data breaches, and account hijacking. Chang et al. (2016) proposed an LSTM based approach which learns tweet level features automatically to extract events from tweet mentions. Lately, Le Sceller et al. (2017) proposed a model to detect cyber security events in Twitter which uses a taxonomy and a set of seed keywords to retrieve relevant tweets. Tonon et al. (2017) proposed a method to detect events from Twitter by using semantic analysis. Roy et al. (2017) proposed a method to learn domain-specific word embeddings for sparse cyber security text. Prior art in this direction Ritter et al. (2015); Chang et al. (2016) focuses on extracting events and in particular predicting the events’ posterior given the presence of particular words. Le Sceller et al. (2017); Tonon et al. (2017) focus on detecting cyber security events from Twitter. Our work distinguishes from prior studies as we formulate cyber security event detection problem as a classification task and learn meta-embeddings from domain-specific word embeddings while incorporating task-specific features and employing neural architectures.
5 Conclusion
We introduced a novel neural model that utilizes meta-embeddings learned from domain-specific word embeddings and task-specific features to capture contextual information. We present a unique dataset of cyber security related noisy short text collected from Twitter. The experimental results indicate that the proposed model outperforms the traditional and neural baselines. Possible future research direction might be detecting cyber security related events in different languages.
Acknowledgments
We would like to thank Merve Nur Yılmaz and Benan Bardak for their invaluable help with the annotation process on this project. This research is fully supported by STM A.Ş. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the sponsor.
Supplementary Notes
In this supplement, we provide the implementation details that we thought might help to reproduce the results reported in the paper.
What about the model hyperparameters?
In Table 4, we provide the hyperparameters we used to report the results in the paper.
Can we download the data?
Yes. Along with this submission, we provide the whole dataset we collected. Nevertheless, due to the restriction imposed by Twitter, the dataset only contains unique tweet IDs. However, the associated tweets can be easily downloaded with the provided tweet IDs. Dataset is available at https://stm-ai.github.io/
How to reproduce the results?
Here we describe the key steps to recollect data, retrain model and reproduce results on the test set.
- •
Step 1: As mentioned before, researchers can recollect data through provided tweet IDs.
- •
Step 2: After recollecting data, preprocessing, normalization and tokenization tasks are implemented as detailed in Experiments.
- •
Step 3: In order to learn domain-specific word embeddings on the unlabeled tweet corpus, meta embedding encoders are trained by applying word2vec, GloVe and fastText as discussed in Section 2.
- •
Step 4: Contextual embedding encoder is implemented in order to reveal contextual information as mentioned in Section 2.
- •
Step 5: Network architecture combined by CNNs and RNNs is implemented for detecting cyber security related events as detailed in section 2.
Have you used a simpler model?
We favor simple models over complex ones, but for our task, detecting cyber security related events requires tedious effort as well as domain knowledge. In order to capture this domain knowledge, we designed handcrafted features with domain experts to address some of the challenges of our problem. Nevertheless, we also learn to extract features using deep neural networks.
In the Section 3 of the paper, we also provide ablations where we discuss which part of the proposed method adds how much value to the overall success.
Why did you use all of the contextual features?
At first glance, it might seem that we threw everything that we got to solve the problem. However, we argue that providing contextual features is somewhat yielding a better initialization, thus providing a network to converge better local minima. We also tried out different combinations of contextual features, i.e., LDA, NER, IE by training 2 layered fully connected neural net with them and, although marginally, the combination of all yield the best results, see Table 5. We argue that NER is more biased towards making false positives as it does not consider the word order or semantic meaning and only raises a flag when many relevant terms are apparent. However, results prove that NER’s features could be beneficial when used in combination with IE and LDA which indicates that NER is detecting something unique that IE and LDA could not.
How to recollect data?
As our goal is to develop a system to detect cyber security events, thus collecting more data is crucial for our task. Hence, using the seed keywords as described in the paper Section 3, even more data can be collected using the Twitter’s streaming API over a desired period.
What are the most common words?
Word cloud in Fig. 4 represents the most common words inside the dataset without seed keys.
How about annotations?
We expected annotators to discriminate between a cyber security event and non cyber security event. In that regard, we used a team of annotators, who manually annotated the cyber security related tweets. Each annotator annotated their share of tweets individually, and in sum, the team annotated a total of tweets. Following the same procedure, it is possible to annotate more data, which we believe to help achieve even better results.
How is the human evaluation done?
We randomly selected tweets and provided this subset to human subjects for evaluation. Each annotator evaluated the tweets independently for his/her share of tweets. Then, we compared their annotations against ground-truth annotations.
What about hardware details?
All computations are done on a system with the following specifications: NVIDIA Tesla K GPU with GB of VRAM, GB of RAM and Intel Xeon E processor.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Angeli et al. (2015) Gabor Angeli, Melvin Johnson Premkumar, and Christopher D Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the ACL 2015 .
- 2Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE ICCV , pages 2425–2433.
- 3Atefeh and Khreich (2015) Farzindar Atefeh and Wael Khreich. 2015. A survey of techniques for event detection in twitter. Computational Intelligence , 31(1):132–164.
- 4Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. ar Xiv:1409.0473. Version 7.
- 5Banko et al. (2007) Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI , volume 7, pages 2670–2676.
- 6Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR , 3(Jan):993–1022.
- 7Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. ar Xiv:1607.04606. Version 2.
- 8Bridges et al. (2013) Robert A Bridges, Corinne L Jones, Michael D Iannacone, Kelly M Testa, and John R Goodall. 2013. Automatic labeling for entity extraction in cyber security. ar Xiv:1308.4941. Version 3.
