TL;DR
CyberCertBench is a new benchmark suite for evaluating LLMs' cybersecurity certification knowledge, introducing a Proposer-Verifier framework for interpretability, and analyzing model performance across standards and scales.
Contribution
It presents CyberCertBench, a comprehensive MCQA benchmark for cybersecurity standards, and proposes a novel interpretability framework for LLM evaluation.
Findings
Frontier models reach human expert level in general cybersecurity knowledge.
Model accuracy drops on vendor-specific and formal standards questions.
Scaling trends show diminishing returns for larger models.
Abstract
The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
