PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models
Inpyo Song, Eunji Jeon, Jangwon Lee

TL;DR
PCEval is a novel benchmark that automatically evaluates large language models' abilities in physical computing tasks, including circuit design and code generation, revealing strengths in logical reasoning but challenges in physical layout creation.
Contribution
Introduces PCEval, the first fully automatic benchmark for assessing LLMs' physical computing capabilities in both logical and physical aspects.
Findings
LLMs excel in code generation and logical circuit design.
LLMs struggle with physical breadboard layout and pin connection management.
PCEval provides a reproducible framework for evaluating LLMs in hardware-related tasks.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including software development, education, and technical assistance. Among these, software development is one of the key areas where LLMs are increasingly adopted. However, when hardware constraints are considered-for instance, in physical computing, where software must interact with and control physical hardware -their effectiveness has not been fully explored. To address this gap, we introduce \textsc{PCEval} (Physical Computing Evaluation), the first benchmark in physical computing that enables a fully automatic evaluation of the capabilities of LLM in both the logical and physical aspects of the projects, without requiring human assessment. Our evaluation framework assesses LLMs in generating circuits and producing compatible code across varying levels of project complexity. Through…
Peer Reviews
Decision·Submitted to ICLR 2026
- Novel Evaluation Dimension The paper precisely identifies the missing capability in current LLM evaluation: the ability to reason about and execute tasks that require physical computing. - Reproducible Evaluation Pipeline The benchmark's use of fully automated simulation (Wokwi) ensures objective, quantitative validation of the generated circuits and code. This contrasts with previous work (e.g., EmbedTask, MICRO-25), which relied on subjective human grading or partial execution tests. The in
- Dataset's Scale and Diversity Limitations With only 50 projects, PCEval's coverage is limited when compared to large-scale code or reasoning benchmarks. The tasks are primarily for introductory Arduino applications (LEDs, sensors, and servos), with less emphasis on other topics such as real-time signal processing. -Limited Comparison Lack of comparisons with classical or hybrid design-automation systems (e.g., symbolic circuit solvers, search-based algorithms). This makes it hard for the audi
1. The investigated question is novel and interesting: how good LLMs are in physical computing for educational purposes. It is well motivated. With the development of LLMs in many fundamental tasks such as reasoning, it is important to understand how useful they are in real-life tasks. 2. The study design is carefully constructed. It starts by interviewing multiple CS educators about what problems are critical in their context, making sure that the investigated problems are relevant to real-wor
Mitigation methods can be more thoroughly discussed. For instance, why CoT works on some models but not others? Is there any method to improve model performance instead of simply prompting?
(1) Novel and well-motivated benchmark addressing previously unexamined domain. Clear task decomposition and fully automated evaluation framework ensuring reproducibility
(1) Evaluations lack variance measures, such as pass@k metrics etc. LLMs results could be noisy under temperature sampling. (2) The benchmarks scope is largely Arduino-centric. This raises question about generality and difficulty of the task on real-world example use cases (other than for educational purposes). (3) It seems the LLM constently make mistakes on problems easily to correct (i.e. physical contraint violation). Would incorporating such feedback in a agentic framework improve the res
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques
