LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities   (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah; Mingji Han; Saurabh Pujar; Hammond Pearce; Ayse Coskun,; Gianluca Stringhini

arXiv:2312.12575·cs.CR·July 25, 2024·6 cites

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun,, Gianluca Stringhini

PDF

Open Access 1 Repo

TL;DR

This paper evaluates whether large language models can reliably identify and reason about security vulnerabilities, revealing significant non-robustness and inconsistency in their responses across various scenarios.

Contribution

The paper introduces SecLLMHolmes, a comprehensive evaluation framework and benchmark for assessing LLMs' ability to handle security-related code analysis tasks.

Findings

01

LLMs show non-deterministic responses and incorrect reasoning.

02

Models like PaLM2 and GPT-4 are sensitive to code modifications.

03

LLMs perform poorly in real-world security scenarios.

Abstract

Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like `PaLM2' and `GPT-4': by merely changing function or variable names, or by the addition of library functions in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4cloudops/secllmholmes
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling