Evaluating LLM Agents on Automated Software Analysis Tasks
Islem Bouzenia, Cristian Cadar, Michael Pradel

TL;DR
This paper introduces AnalysisBench, a benchmark for evaluating LLM agents on automated software analysis tasks, and demonstrates that specialized agent architectures significantly outperform baselines.
Contribution
It presents a new benchmark, AnalysisBench, and evaluates multiple LLM agent architectures, highlighting key limitations and factors affecting performance in automated software analysis.
Findings
AnalysisAgent achieves 94% success rate on benchmark tasks.
Agent architecture impacts performance more than LLM capability.
Whole-program analysis and symbolic execution are particularly challenging.
Abstract
Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
