MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models

Luca Collini; Baleegh Ahmad; Joey Ah-kiow; Ramesh Karri

arXiv:2505.11963·cs.CR·February 25, 2026

MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models

Luca Collini, Baleegh Ahmad, Joey Ah-kiow, Ramesh Karri

PDF

Open Access 3 Reviews

TL;DR

MARVEL is a multi-agent LLM framework that automates hardware security vulnerability detection in RTL code by mimicking a designer's decision process, integrating various tools and reasoning strategies.

Contribution

It introduces a novel multi-agent LLM system for unified decision-making and tool use in RTL security verification, enhancing detection accuracy and reasoning capabilities.

Findings

01

Detected 19 valid security vulnerabilities in a buggy SoC

02

Identified 14 concrete warnings, demonstrating practical utility

03

Reported 18 hallucinated reports, highlighting challenges in LLM reliability

Abstract

Hardware security verification is a challenging and time-consuming task. Design engineers may use formal verification, linting, and functional simulation tests, coupled with analysis and a deep understanding of the hardware design being inspected. Large Language Models (LLMs) have been used to assist during this task, either directly or in conjunction with existing tools. We improve the state of the art by proposing MARVEL, a multi-agent LLM framework for a unified approach to decision-making, tool use, and reasoning. MARVEL mimics the cognitive process of a designer looking for security vulnerabilities in RTL code. It consists of a supervisor agent that devises the security policy of the system-on-chips (SoCs) using its security documentation. It delegates tasks to validate the security policy to individual executor agents. Each executor agent carries out its assigned task using a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

(1) This idea is straightforward and can be easily understood. (2) The experiments explore multiple GPT models for detecting hardware vulnerabilities, which is interesting.

Weaknesses

(1) This paper focuses on the use of Large Language Models (LLMs) for hardware security; however, it remains unclear how the authors specifically leverage these models for vulnerability detection. The methodology section lacks sufficient detail on how the LLMs are integrated into the detection pipeline, what kind of data or prompts are used, and how the outputs are analyzed or validated. (2) My another concern about this paper is the novelty issue. The paper appears to directly apply existing

Reviewer 02Rating 4Confidence 3

Strengths

1. This work claims to be the first comprehensive multi-agent approach for RTL security verification. 2. The paper addresses hardware security verification. The approach handles hardware-specific challenges including clocking, concurrency, hardware CWEs, and FSM properties, and is evaluated on a real-world Hack@DATE benchmark. 3. MARVEL demonstrates solid engineering through practical integration of industry-standard EDA tools (VC SpyGlass Lint, VC Formal, Verilator), iterative refinement fo

Weaknesses

1. Table 2 presents security issues classified as 'correct' or 'incorrect,' but the paper never specifies who made these determinations. If the authors themselves judged their system's outputs, this constitutes circular reasoning and lacks objectivity. The absence of inter-rater reliability metrics, independent expert validation, or comparison with the official Hack@DATE answer key fundamentally undermines the credibility of the paper's central claims. 2. The evaluation is fundamentally incompl

Reviewer 03Rating 4Confidence 3

Strengths

1. The design of MARVEL is clear and intuitive. Agents and toolchains are concretely described, including their prompts, actions, and failure-recovery loops. 2. MARVEL is fully automated and useful for real RTL workflows and complementary to human verification. 3. Evaluation shows that MARVEL can detect security vulnerabilities with a 19/51 TP ratio.

Weaknesses

1. The novelty of this work should be argued more precisely. For example, the paper claims that MARVEL is the "first" multi-agent framework for hardware bug detection. However, SV-LLM is also multi-agent, albeit with weaker agent integration. The authors should clarify the novelty of MARVEL relative to existing studies in terms of what is technically new. 2. The evaluation relies on a non-public SoC, and the full ground-truth bug list is not available. This limits community reproducibility and p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Formal Methods in Verification · Physical Unclonable Functions (PUFs) and Hardware Security