PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models

Inpyo Song; Eunji Jeon; Jangwon Lee

arXiv:2601.02404·cs.CL·January 7, 2026

PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models

Inpyo Song, Eunji Jeon, Jangwon Lee

PDF

Open Access 3 Reviews

TL;DR

PCEval is a novel benchmark that automatically evaluates large language models' abilities in physical computing tasks, including circuit design and code generation, revealing strengths in logical reasoning but challenges in physical layout creation.

Contribution

Introduces PCEval, the first fully automatic benchmark for assessing LLMs' physical computing capabilities in both logical and physical aspects.

Findings

01

LLMs excel in code generation and logical circuit design.

02

LLMs struggle with physical breadboard layout and pin connection management.

03

PCEval provides a reproducible framework for evaluating LLMs in hardware-related tasks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including software development, education, and technical assistance. Among these, software development is one of the key areas where LLMs are increasingly adopted. However, when hardware constraints are considered-for instance, in physical computing, where software must interact with and control physical hardware -their effectiveness has not been fully explored. To address this gap, we introduce \textsc{PCEval} (Physical Computing Evaluation), the first benchmark in physical computing that enables a fully automatic evaluation of the capabilities of LLM in both the logical and physical aspects of the projects, without requiring human assessment. Our evaluation framework assesses LLMs in generating circuits and producing compatible code across varying levels of project complexity. Through…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 2

Strengths

- Novel Evaluation Dimension The paper precisely identifies the missing capability in current LLM evaluation: the ability to reason about and execute tasks that require physical computing. - Reproducible Evaluation Pipeline The benchmark's use of fully automated simulation (Wokwi) ensures objective, quantitative validation of the generated circuits and code. This contrasts with previous work (e.g., EmbedTask, MICRO-25), which relied on subjective human grading or partial execution tests. The in

Weaknesses

- Dataset's Scale and Diversity Limitations With only 50 projects, PCEval's coverage is limited when compared to large-scale code or reasoning benchmarks. The tasks are primarily for introductory Arduino applications (LEDs, sensors, and servos), with less emphasis on other topics such as real-time signal processing. -Limited Comparison Lack of comparisons with classical or hybrid design-automation systems (e.g., symbolic circuit solvers, search-based algorithms). This makes it hard for the audi

Reviewer 02Rating 8Confidence 2

Strengths

1. The investigated question is novel and interesting: how good LLMs are in physical computing for educational purposes. It is well motivated. With the development of LLMs in many fundamental tasks such as reasoning, it is important to understand how useful they are in real-life tasks. 2. The study design is carefully constructed. It starts by interviewing multiple CS educators about what problems are critical in their context, making sure that the investigated problems are relevant to real-wor

Weaknesses

Mitigation methods can be more thoroughly discussed. For instance, why CoT works on some models but not others? Is there any method to improve model performance instead of simply prompting?

Reviewer 03Rating 4Confidence 4

Strengths

(1) Novel and well-motivated benchmark addressing previously unexamined domain. Clear task decomposition and fully automated evaluation framework ensuring reproducibility

Weaknesses

(1) Evaluations lack variance measures, such as pass@k metrics etc. LLMs results could be noisy under temperature sampling. (2) The benchmarks scope is largely Arduino-centric. This raises question about generality and difficulty of the task on real-world example use cases (other than for educational purposes). (3) It seems the LLM constently make mistakes on problems easily to correct (i.e. physical contraint violation). Would incorporating such feedback in a agentic framework improve the res

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques