QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture

Shvetank Prakash; Andrew Cheng; Arya Tschand; Mark Mazumder; Varun Gohil; Jeffrey Ma; Jason Yik; Zishen Wan; Jessica Quaye; Elisavet Lydia Alvanaki; Avinash Kumar; Chandrashis Mazumdar; Tuhin Khare; Alexander Ingare; Ikechukwu Uchendu; Radhika Ghosal; Abhishek Tyagi; Chenyu Wang; Andrea Mattia Garavagno; Sarah Gu; Alice Guo; Grace Hur; Luca Carloni; Tushar Krishna; Ankita Nayak; Amir Yazdanbakhsh; Vijay Janapa Reddi

arXiv:2510.22087·cs.AR·October 28, 2025

QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture

Shvetank Prakash, Andrew Cheng, Arya Tschand, Mark Mazumder, Varun Gohil, Jeffrey Ma, Jason Yik, Zishen Wan, Jessica Quaye, Elisavet Lydia Alvanaki, Avinash Kumar, Chandrashis Mazumdar, Tuhin Khare, Alexander Ingare, Ikechukwu Uchendu, Radhika Ghosal, Abhishek Tyagi, Chenyu Wang

PDF

3 Reviews

TL;DR

QuArch is a comprehensive benchmark with 2,671 expert-validated questions designed to evaluate large language models' reasoning and knowledge in computer architecture, revealing significant gaps in advanced understanding.

Contribution

This paper introduces QuArch, the first benchmark specifically targeting LLM reasoning in computer architecture, filling a gap in current evaluation methods.

Findings

01

LLMs have domain knowledge but struggle with higher-order reasoning.

02

Model accuracy varies from 34% to 72% on advanced questions.

03

QuArch provides a foundation for improving LLM capabilities in architecture.

Abstract

The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 72%) on these advanced questions, highlighting…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. This work addresses the lack of LLM evaluation benchmarks in computer architecture. The inclusion of expert validation for all 2,671 question-answer pairs lends credibility to the dataset's quality and technical accuracy. 2. The authors evaluate a wide range of frontier LLMs.

Weaknesses

The "Implement" skill is tested by asking models to produce artifacts like code or simulation scripts. This is a step in the right direction, but it's a simplified version of real-world implementation, which involves complex toolchains, debugging, verification, and performance tuning that cannot be fully captured in this format.

Reviewer 02Rating 6Confidence 2

Strengths

1. The investigated problem is interesting: there are many datasets on software design, but it is also important to understand whether we can use LLMs in hardware design. 2. The dataset consists of more than 2k QA pairs validated by experts, providing a large scale testbed with high-quality data instances for researchers to understand and improve model behaviors. 3. The paper provides interesting insights for LLMs' common behavior patterns on this benchmark. These insights make it possible to

Weaknesses

1. The head room of the benchmark does not seem to be very big. GPT-5 is already having accuracy of above 70 on the reasoning task and around 90 on the recall task. It seems to me that even if we can improve model performance, the space for improvement is quite small. Considering that benchmarks recently tend to be easily saturated with increasingly powerful models, I'm a bit worried about for how long the benchmark can be used. 2. Is there any method we can use to improve LLM performance in re

Reviewer 03Rating 2Confidence 5

Strengths

1) This work presents an interesting benchmark for reasoning in computer architectures 2) This benchmark helps to demonstrate the shortcoming of LLMs in high-order thinking.

Weaknesses

1) According to the sources of the benchmark, there is limited evidence that the QA tasks reflect the iterative, data-driven decision-making and trade-off analysis that human architects perform. This limits ecological validity — i.e., how well the benchmark reflects real-world reasoning in architectural design. 2) Despite the efforts, the benchmark’s scope is restricted to a relatively small set of expert-validated questions and may not scale to cover the full breadth of computer architectural

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.