Large Language Models for IT Automation Tasks: Are We There Yet?
Md Mahadi Hassan, John Salvador, Akond Rahman, and Santu Karmaker

TL;DR
This paper assesses the ability of 14 open-source large language models to generate functional Ansible automation scripts for IT tasks, revealing significant limitations in state reasoning and domain-specific knowledge.
Contribution
Introduces ITAB, a new benchmark with 126 diverse IT automation tasks focusing on state reconciliation, and analyzes LLM failures in practical IT automation scenarios.
Findings
LLMs achieve pass@10 rates below 12% on ITAB.
Majority of errors stem from state reconciliation failures and module knowledge deficiencies.
Reliable IT automation requires advances in state reasoning and domain-specific understanding.
Abstract
LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools, such as Ansible. We present ITAB (IT Automation Task Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers, managing files) where each task accounts for state reconciliation: a property unique to IT automation tools. ITAB evaluates LLMs' ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source LLMs, none of which accomplish pass@10 at a rate beyond 12%. To explain these low scores, we analyze 1,411 execution failures across the evaluated LLMs and identify two main categories of prevalent semantic errors: failures in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
