The Program Testing Ability of Large Language Models for Code
Weimin Xiong, Yiwen Guo, Hao Chen

TL;DR
This paper investigates the ability of large language models to test code, revealing their properties and demonstrating how their testing capabilities can enhance program synthesis, leading to improved code quality and higher pass rates.
Contribution
It provides a comprehensive analysis of LLMs' program testing abilities and proposes methods to leverage these capabilities for better code synthesis results.
Findings
LLMs exhibit intriguing properties in program testing.
Testing capabilities can be improved to enhance code synthesis.
Achieved +11.77% and +4.22% higher pass rates on HumanEval+.
Abstract
Recent development of large language models (LLMs) for code like CodeX and CodeT5+ demonstrates tremendous promise in achieving code intelligence. Their ability of synthesizing code that completes a program for performing a pre-defined task has been intensively tested and verified on benchmark datasets including HumanEval and MBPP. Yet, evaluation of these LLMs from more perspectives (than just program synthesis) is also anticipated, considering their broad scope of applications in software engineering. In this paper, we explore the ability of LLMs for testing programs/code. By performing thorough analyses of recent LLMs for code in program testing, we show a series of intriguing properties of these models and demonstrate how program testing ability of LLMs can be improved. Following recent work which utilizes generated test cases to enhance program synthesis, we further leverage our…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. First analysis of the test case generation ability of large language models for competitive coding (algorithmic coding) problems. 1. I like the takeaway that a model generates better test cases when you first ask it to generate code to solve the problem (analogous to Chain of Thought). This is a good empirical result backed by intuition.
I like the general premise of this work. However, it it my opinion that it is not ready for publication in its current form due to several drawbacks: 1. There are two distinct problems that this paper is trying to solve. The first is test generation, and the second is test-guided program generation (analogous to CodeT). However in attempting to do both of these, the paper ends up not doing justice to either one. 1. If the aim is to study test generation with language models, then: - The pa
1. The ability of LLMs to generate valid and useful test cases is an interesting question worthy of study. 2. The experimental design is largely sound (see some concerns below in **Questions**) and the experiments are comprehensive (in terms of the number of models and datasets considered). 3. The experiments lead to interesting yet expected conclusions. For instance, larger models and models with better code synthesis capabilities also tend to be better at generating test cases. It also turns
1. It is not clear to me why test case generation is restricted to three test cases. In practice, one would want to generate a much larger number of tests and therefore, it is important to evaluate the performance of the models in such settings. The authors already note that the first generated test case often is more likely to be correct than the third test. If the number of generated tests is increased, this problem is likely to become worse and this would suggest that LLMs should NOT be used
- Impressive evaluation using 11 LLMs and different testing settings. The benchmarks could be improved, but are on par with what is being used by others in the field. - A solid set of observations, as outlined in my summary. I would argue that some of these are well-known (eg 1, 3).
- The experimental evaluation and the observations are nice and make sense to share with the community, but the actual contribution of the paper beyond that is relatively minor. - Using self-generated code (instead of a placeholder) is not exactly a fair comparison when compared to previous work, as it samples the model 2x for each generation. Using the place-holder code was somewhat intended in the original technique so you only generate the code itself once.
This paper addresses an important problem that is the testing ability of LLMs. The writing is overall good, though there are some sentences that I am not sure about the meaning. The results, while being preliminary, might be still useful. 10+ LLMs have been evaluated.
The significance and novelty of this work is unclear. The pass rate and test generation of the LLM has been already studied in existing work (Chen et al., 2021). Important relevant work was also missed in the paper. It was said, on Page 1, that “In this paper, we, for the first time, analyze the ability of recent LLMs in testing programs/code.” This is at least over-claiming. Existing work like CodeMosa already applies LLMs for improving code coverage and it was not mentioned or compared in this
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Layer Normalization · Dropout · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia? · Softmax · Byte Pair Encoding
