AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits
Yichen Shi, Ze Zhang, Hongyang Wang, Zhuofu Tao, Zhongyi Li, Bingyu Chen, Yaxin Wang, Zhen huang, Xuhua Liu, Quan Chen, Zhiping Yu, Ting-Jung Lin, Lei He

TL;DR
This paper introduces AMSbench, a comprehensive benchmark suite for evaluating Multi-modal Large Language Models' capabilities in analog/mixed-signal circuit analysis and design, revealing current limitations and guiding future improvements.
Contribution
The paper presents AMSbench, the first extensive benchmark for assessing MLLMs across diverse AMS circuit tasks, including perception, analysis, and design, with around 8000 test questions.
Findings
Current MLLMs show significant limitations in complex circuit reasoning.
Models perform poorly on multi-modal and sophisticated design tasks.
Benchmark results highlight the need for advancing MLLMs in AMS domain.
Abstract
Analog/Mixed-Signal (AMS) circuits play a critical role in the integrated circuit (IC) industry. However, automating Analog/Mixed-Signal (AMS) circuit design has remained a longstanding challenge due to its difficulty and complexity. Although recent advances in Multi-modal Large Language Models (MLLMs) offer promising potential for supporting AMS circuit analysis and design, current research typically evaluates MLLMs on isolated tasks within the domain, lacking a comprehensive benchmark that systematically assesses model capabilities across diverse AMS-related challenges. To address this gap, we introduce AMSbench, a benchmark suite designed to evaluate MLLM performance across critical tasks including circuit schematic perception, circuit analysis, and circuit design. AMSbench comprises approximately 8000 test questions spanning multiple difficulty levels and assesses eight prominent…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Coverage is comprehensive. This work covers AMS in depth, providing 8674 question-answer pairs spanning three core tasks—schematic perception, circuit analysis, and circuit design, which are further broken down into 18 distinct sub-tasks. 2. The generative evaluation is interesting to see, and it highlights a key drawback of MLLM at AMS. (although the relatie #of question for design is small, 68) 3. The paper is easy to follow.
1. The topic has been addressed before. This work seems very similar to [1], which also benchmarks MLLM for electrical circuits or related topics and have been published at prior conference. They also reached similar results (MLLM perform poorly, the domain gap in pretraining could be one cause, etc), however, it is not mentioned in this manuscript. The authors are encouraged to address this work and clearly clarify how they are different from this work. 2. Benchmark difficulty is in question.
- Comprehensive Benchmark Scope and Design AMSbench is the first end-to-end evaluation pipeline that includes schematic perception, circuit analysis, and design reasoning all in one benchmark. Prior research (e.g., MMCircuitEval, AnalogCoder) were primarily concerned with symbolic circuit reasoning or schematic captioning, but AMSbench expressly combines multimodal visual and textual information. Because of its broad reach, it is ideally positioned to assess both low-level perception (component
- Limited novelty in methodology The benchmark-building workflow is mostly based on current dataset production paradigms such as template-driven QA, schematic parsing, and human verification. The key contribution is domain adaption to AMS, not a methodological leap in benchmark construction. - Imbalanced task distribution. The perception subset accounts for around 6,000 samples, but the design and testbench generation subsets are much less (~ hundreds). This mismatch reduces statistical confi
- This work focuses on a meaning direction: building a benchmark for AMS circuits, which may have great impact for the future research. - The design of benchmark is based on a deep and comprehensive understanding of circuit design. - There are extensive experiments, which offers lots of insight to the audience.
- What is the error rate in the proposed benchmark? As mentioned, the questions and answers are created by humans and AI models, will there be any error in the benchmark (e.g., the answer is incorrect or the question is not a reasonable one)? Do we have a mechanism in the data curation to control the error rate? - In lines 213 to 214, it is mentioned that there are textual question answering (TQA) in the benchmark. In Section 4.2, the metrics considered include ACC, F1, NED and pass@k. It seems
1. The benchmark is comprehensive and well-structured. 2. The paper is clearly written and easy to follow. 3. Analog/Mixed-Signal (AMS) circuits understanding for MLLMs is an important problem.
1. Missing related work: The paper does not discuss EEE-Bench [1], a comprehensive multimodal electrical and electronics engineering benchmark that also focuses heavily on circuit analysis. 2. Overstated claims and unclear novelty: The authors claim that AMSbench is the first holistic benchmark for systematically evaluating MLLM performance in AMS circuits. However, prior works such as EEE-Bench and MMCircuitEval have already conducted extensive studies in this domain. The paper should clearly
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Pathology Studies · Superconducting Materials and Applications
