Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Terry Leitch

arXiv:2604.18566·cs.AI·April 22, 2026

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Terry Leitch

PDF

TL;DR

This paper systematically evaluates cloud and local LLMs on system dynamics tasks, revealing performance differences, limitations, and the impact of model type and backend choices.

Contribution

It provides a comprehensive analysis of model type effects, backend impacts, and benchmarking results for LLMs on system dynamics tasks.

Findings

01

Cloud models achieve 77-89% on CLD extraction; best local matches mid-tier cloud at 77%.

02

Local models reach 50-100% on model building, but only 0-50% on error fixing.

03

Backend choice impacts JSON handling more than quantization levels.

Abstract

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.