Benchmarking Causal Study to Interpret Large Language Models for Source Code
Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke,, Denys Poshyvanyk

TL;DR
This paper introduces Galeras, a benchmarking strategy using causal inference to evaluate large language models for source code tasks, addressing confounding factors and improving interpretability of performance metrics.
Contribution
The paper presents a novel benchmarking approach incorporating causal inference to better interpret LLM performance on software engineering tasks, accounting for confounders.
Findings
Prompt semantics positively influence ChatGPT's performance by about 3%.
Prompt size is highly correlated with accuracy metrics (~0.412%).
Causal inference reduces confounding bias, enhancing interpretability.
Abstract
One of the most common solutions adopted by software researchers to address code generation is by training Large Language Models (LLMs) on massive amounts of source code. Although a number of studies have shown that LLMs have been effectively evaluated on popular accuracy metrics (e.g., BLEU, CodeBleu), previous research has largely overlooked the role of Causal Inference as a fundamental component of the interpretability of LLMs' performance. Existing benchmarks and datasets are meant to highlight the difference between the expected and the generated outcome, but do not take into account confounding variables (e.g., lines of code, prompt size) that equally influence the accuracy metrics. The fact remains that, when dealing with generative software tasks by LLMs, no benchmark is available to tell researchers how to quantify neither the causal effect of SE-based treatments nor the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Scientific Computing and Data Management
