MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs
Manar Abdelatty, Jingxiao Ma, Sherief Reda

TL;DR
This paper introduces MetRex, a large dataset and benchmark for evaluating LLMs' ability to reason about post-synthesis Verilog HDL metrics, demonstrating improved performance with fine-tuning and reasoning templates.
Contribution
The paper presents the first large-scale dataset and benchmark for LLM-based reasoning of Verilog post-synthesis metrics, along with methods to enhance reasoning accuracy.
Findings
Supervised Fine-Tuning improves reasoning accuracy by up to 37%.
The approach predicts metrics for 17.4% more designs within 5% error.
The method offers a 1.7x speedup over traditional regression models.
Abstract
Large Language Models (LLMs) have been applied to various hardware design tasks, including Verilog code generation, EDA tool scripting, and RTL bug fixing. Despite this extensive exploration, LLMs are yet to be used for the task of post-synthesis metric reasoning and estimation of HDL designs. In this paper, we assess the ability of LLMs to reason about post-synthesis metrics of Verilog designs. We introduce MetRex, a large-scale dataset comprising 25,868 Verilog HDL designs and their corresponding post-synthesis metrics, namely area, delay, and static power. MetRex incorporates a Chain of Thought (CoT) template to enhance LLMs' reasoning about these metrics. Extensive experiments show that Supervised Fine-Tuning (SFT) boosts the LLM's reasoning capabilities on average by 37.0\%, 25.3\%, and 25.7\% on the area, delay, and static power, respectively. While SFT improves performance on our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsShrink and Fine-Tune
