SynthCTI: LLM-Driven Synthetic CTI Generation to enhance MITRE Technique Mapping
\'Alvaro Ruiz-R\'odenas, Jaime Pujante S\'aez, Daniel Garc\'ia-Algora, Mario Rodr\'iguez B\'ejar, Jorge Blasco, Jos\'e Luis Hern\'andez-Ramos

TL;DR
SynthCTI introduces a data augmentation framework that generates high-quality synthetic threat intelligence sentences to address data scarcity and class imbalance in mapping CTI to MITRE ATT&CK techniques, improving classification performance.
Contribution
The paper presents SynthCTI, a novel clustering-based data augmentation method using LLMs to generate diverse, semantically faithful synthetic CTI data for underrepresented techniques.
Findings
Synthetic data improves macro-F1 scores significantly.
Smaller models with augmentation outperform larger models without it.
SynthCTI enhances CTI classification accuracy across datasets.
Abstract
Cyber Threat Intelligence (CTI) mining involves extracting structured insights from unstructured threat data, enabling organizations to understand and respond to evolving adversarial behavior. A key task in CTI mining is mapping threat descriptions to MITRE ATT\&CK techniques. However, this process is often performed manually, requiring expert knowledge and substantial effort. Automated approaches face two major challenges: the scarcity of high-quality labeled CTI data and class imbalance, where many techniques have very few examples. While domain-specific Large Language Models (LLMs) such as SecureBERT have shown improved performance, most recent work focuses on model architecture rather than addressing the data limitations. In this work, we present SynthCTI, a data augmentation framework designed to generate high-quality synthetic CTI sentences for underrepresented MITRE ATT\&CK…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications · Nuclear Physics and Applications · Fault Detection and Control Systems
