ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation
Debalina Ghosh Paul, Hong Zhu, Ian Bayley

TL;DR
This paper introduces ScenEval, a benchmark for scenario-based evaluation of code generation models, demonstrating its use with ChatGPT on Java tasks and analyzing performance across different complexities.
Contribution
It proposes a methodology for constructing scenario-based benchmarks with metadata, exemplified by ScenEval for code generation evaluation.
Findings
ChatGPT's performance drops with increasing task complexity.
Generated code is shorter but often more complex when correct.
Incorrect generated code tends to be less complex than reference solutions.
Abstract
In the scenario-based evaluation of machine learning models, a key problem is how to construct test datasets that represent various scenarios. The methodology proposed in this paper is to construct a benchmark and attach metadata to each test case. Then a test system can be constructed with test morphisms that filter the test cases based on metadata to form a dataset. The paper demonstrates this methodology with large language models for code generation. A benchmark called ScenEval is constructed from problems in textbooks, an online tutorial website and Stack Overflow. Filtering by scenario is demonstrated and the test sets are used to evaluate ChatGPT for Java code generation. Our experiments found that the performance of ChatGPT decreases with the complexity of the coding task. It is weakest for advanced topics like multi-threading, data structure algorithms and recursive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Testing and Debugging Techniques · Software Engineering Research
