AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Md Tanvirul Alam, Dipkamal Bhusal, Salman Ahmad, Nidhi Rastogi, Peter Worth

TL;DR
AthenaBench is an advanced benchmark designed to evaluate large language models' effectiveness in cyber threat intelligence tasks, revealing current limitations in reasoning abilities and emphasizing the need for specialized models.
Contribution
This work extends the CTIBench benchmark by creating AthenaBench with improved datasets, evaluation metrics, and new tasks, providing a comprehensive evaluation framework for LLMs in CTI.
Findings
Proprietary LLMs outperform open-source models overall.
Performance drops significantly on reasoning-intensive tasks.
Current LLMs have fundamental limitations in CTI reasoning capabilities.
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Intelligence, Security, War Strategy · Topic Modeling
