Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis
Krishna Narasimhan

TL;DR
This study compares three LLM-based architectures for translating natural language into code analysis queries, finding that structured intermediate representations significantly improve accuracy, especially for large models.
Contribution
It introduces and evaluates a spectrum of LLM involvement architectures, highlighting the effectiveness of structured intermediate representations over direct or agentic approaches.
Findings
Structured intermediate representation outperforms direct generation by 15-25 percentage points.
Large models benefit most from constrained, well-typed intermediates.
Schema compliance limits small models' performance, despite the structured approach.
Abstract
Large language models are increasingly used to make static analysis tools accessible through natural language, yet existing systems differ in how much they delegate to the LLM without treating the degree of delegation as an independent variable. We compare three architectures along a spectrum of LLM involvement for translating natural language to Joern's query language \cpgql{}: direct query generation (\approach{1}), generation of a schema-constrained JSON intermediate representation (\approach{2}), and tool-augmented agentic generation (\approach{3}). These are evaluated on a benchmark of 20 code analysis tasks across three complexity tiers, using four open-weight models in a 2\(\times\)2 design (two model families \(\times\) two scales), each with three repetitions. The structured intermediate representation (\approach{2}) achieves the highest result match rates, outperforming direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
