EmbedAgent: Benchmarking Large Language Models in Embedded System Development

Ruiyang Xu; Jialun Cao; Mingyuan Wu; Wenliang Zhong; Yaojie Lu; Ben He; Xianpei Han; Shing-Chi Cheung; Le Sun

arXiv:2506.11003·cs.SE·January 26, 2026

EmbedAgent: Benchmarking Large Language Models in Embedded System Development

Ruiyang Xu, Jialun Cao, Mingyuan Wu, Wenliang Zhong, Yaojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, Le Sun

PDF

Open Access 1 Datasets

TL;DR

This paper introduces EmbedAgent and Embedbench to evaluate large language models in embedded system development, revealing current limitations and proposing strategies to enhance their performance in real-world tasks.

Contribution

The paper presents EmbedAgent and Embedbench, pioneering benchmarks and paradigms for assessing LLMs in embedded system tasks, and proposes retrieval and feedback strategies to improve their capabilities.

Findings

01

DeepSeek-R1 achieves 55.6% pass@1 with schematic info

02

MicroPython on Raspberry Pi Pico reaches 73.8% pass@1

03

Strategies improve Deepseek-R1 to 65.1% pass@1 and migration accuracy to 27.8%

Abstract

Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development. In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap between digital and physical systems, allowing for a more comprehensive assessment of their capabilities. To evaluate LLMs on these tasks, we propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration. Embedbench consists of 126 cases, covering 9 electronic components across 3 hardware platforms. Through extensive experiments on 10 mainstream LLMs, we uncover several key findings. Surprisingly, despite the simplicity of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xhwl/EmbedBench
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques