Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
Terry Leitch

TL;DR
This paper systematically evaluates cloud and local LLMs on system dynamics tasks, revealing performance differences, limitations, and the impact of model type and backend choices.
Contribution
It provides a comprehensive analysis of model type effects, backend impacts, and benchmarking results for LLMs on system dynamics tasks.
Findings
Cloud models achieve 77-89% on CLD extraction; best local matches mid-tier cloud at 77%.
Local models reach 50-100% on model building, but only 0-50% on error fixing.
Backend choice impacts JSON handling more than quantization levels.
Abstract
We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
