Automated Consistency Analysis of LLMs
Aditya Patwardhan, Vivek Vaidya, Ashish Kundu

TL;DR
This paper defines and evaluates the consistency of responses from large language models in cybersecurity, revealing that current models often produce inconsistent answers, which impacts their trustworthiness in critical applications.
Contribution
It introduces a formal definition of LLM response consistency and develops a framework for evaluating it through self-validation and cross-model validation.
Findings
LLMs often produce inconsistent responses in cybersecurity tasks.
The proposed framework effectively measures LLM response consistency.
Experiments show inconsistency issues across multiple popular LLMs.
Abstract
Generative AI (Gen AI) with large language models (LLMs) are being widely adopted across the industry, academia and government. Cybersecurity is one of the key sectors where LLMs can be and/or are already being used. There are a number of problems that inhibit the adoption of trustworthy Gen AI and LLMs in cybersecurity and such other critical areas. One of the key challenge to the trustworthiness and reliability of LLMs is: how consistent an LLM is in its responses? In this paper, we have analyzed and developed a formal definition of consistency of responses of LLMs. We have formally defined what is consistency of responses and then develop a framework for consistency evaluation. The paper proposes two approaches to validate consistency: self-validation, and validation across multiple LLMs. We have carried out extensive experiments for several LLMs such as GPT4oMini, GPT3.5, Gemini,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Service-Oriented Architecture and Web Services
