AttackQA: Development and Adoption of a Dataset for Assisting   Cybersecurity Operations using Fine-tuned and Open-Source LLMs

Varun Badrinath Krishna

arXiv:2411.01073·cs.LG·November 5, 2024

AttackQA: Development and Adoption of a Dataset for Assisting Cybersecurity Operations using Fine-tuned and Open-Source LLMs

Varun Badrinath Krishna

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AttackQA, a large cybersecurity Q&A dataset created with open-source LLMs, and demonstrates how fine-tuning open-source models enhances cybersecurity response accuracy in RAG systems.

Contribution

The paper presents a new cybersecurity Q&A dataset and shows that fine-tuning open-source LLMs improves RAG system performance over proprietary models.

Findings

01

Fine-tuning open-source models outperforms GPT-4o in accuracy.

02

The dataset contains 25,335 Q&A pairs with rationales.

03

Open-source RAG system achieves high-speed, high-accuracy results.

Abstract

Retrieval-augmented generation (RAG) on specialized domain datasets has shown improved performance when large language models (LLMs) are fine-tuned for generating responses to user queries. In this study, we develop a cybersecurity question-answering (Q\&A) dataset, called AttackQA, and employ it to build a RAG-based Q\&A system designed for analysts in security operations centers. The dataset comprises 25,335 Q\&A pairs, accompanied by rationales to facilitate fine-tuning and evaluation. 80\% of the dataset was generated with help of a lightweight open-source LLM (LLama 3 8B), which produced over 1100 tokens per second with full 16-bit precision on SambaNova System's SN40L specialized hardware. To ensure dataset quality, we fine-tuned LLama 3 70B to detect and reject low-quality Q\&A pairs. In using the dataset for RAG, we demonstrate that fine-tuning open-source embeddings and LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sambanovasystems/attackqa
dataset· 134 dl
134 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Big Data and Business Intelligence

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Dropout · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · WordPiece · Adam