AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Md Tanvirul Alam; Dipkamal Bhusal; Salman Ahmad; Nidhi Rastogi; Peter Worth

arXiv:2511.01144·cs.CR·February 17, 2026

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Md Tanvirul Alam, Dipkamal Bhusal, Salman Ahmad, Nidhi Rastogi, Peter Worth

PDF

Open Access

TL;DR

AthenaBench is an advanced benchmark designed to evaluate large language models' effectiveness in cyber threat intelligence tasks, revealing current limitations in reasoning abilities and emphasizing the need for specialized models.

Contribution

This work extends the CTIBench benchmark by creating AthenaBench with improved datasets, evaluation metrics, and new tasks, providing a comprehensive evaluation framework for LLMs in CTI.

Findings

01

Proprietary LLMs outperform open-source models overall.

02

Performance drops significantly on reasoning-intensive tasks.

03

Current LLMs have fundamental limitations in CTI reasoning capabilities.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Intelligence, Security, War Strategy · Topic Modeling