EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge

TL;DR
This paper introduces EngiAI, a comprehensive multi-agent benchmark suite for evaluating LLM-driven engineering design tasks across workflows, retrieval, and HPC, along with a reference multi-agent system implementation.
Contribution
It presents a novel benchmark suite with diverse evaluation dimensions and a multi-agent system framework for engineering design, addressing gaps in existing evaluation methods.
Findings
Proprietary models achieve 96-97% task completion on Beams2D.
Open-source 4B models reach 55-78% completion, showing improvement.
Retrieval-augmented scores are near perfect with gating, validating the evaluation design.
Abstract
Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
