Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity
Tam n. Nguyen

TL;DR
OllaBench is a new evaluation framework for assessing LLMs' reasoning in human-centric interdependent cybersecurity, focusing on accuracy, wastefulness, and consistency, with insights from cognitive theories and empirical data.
Contribution
The paper introduces OllaBench, a comprehensive, theory-based evaluation framework for LLMs in cybersecurity, addressing human factors often overlooked in prior assessments.
Findings
Commercial LLMs achieve higher accuracy but still have room for improvement.
Open-weight LLMs perform competitively in some aspects, especially in efficiency.
Significant differences exist among models in token efficiency and consistency.
Abstract
Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems, improving cybersecurity threat modeling and risk management. However, evaluating LLMs in this context is crucial for legal compliance and effective application development. Existing LLM evaluation frameworks often overlook the human factor and cognitive computing capabilities essential for interdependent cybersecurity. To address this gap, I propose OllaBench, a novel evaluation framework that assesses LLMs' accuracy, wastefulness, and consistency in answering scenario-based information security compliance and non-compliance questions. OllaBench is built on a foundation of 24 cognitive behavioral theories and empirical evidence from 38 peer-reviewed papers. OllaBench was used to evaluate 21 LLMs, including both open-weight and commercial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Cybersecurity and Cyber Warfare Studies
