Do LLMs generate test oracles that capture the actual or the expected   program behaviour?

Michael Konstantinou; Renzo Degiovanni; Mike Papadakis

arXiv:2410.21136·cs.SE·October 29, 2024·2 cites

Do LLMs generate test oracles that capture the actual or the expected program behaviour?

Michael Konstantinou, Renzo Degiovanni, Mike Papadakis

PDF

Open Access

TL;DR

This paper investigates whether Large Language Models generate test oracles that accurately reflect the expected program behavior, revealing they tend to capture actual behavior and are more effective at generating than classifying oracles.

Contribution

It provides an empirical analysis of LLM-generated test oracles, highlighting their limitations and strengths in capturing expected behavior compared to traditional methods.

Findings

01

LLMs tend to generate oracles capturing actual behavior rather than expected.

02

LLMs are better at generating than classifying test oracles.

03

LLM-generated oracles have higher fault detection potential.

Abstract

Software testing is an essential part of the software development cycle to improve the code quality. Typically, a unit test consists of a test prefix and a test oracle which captures the developer's intended behaviour. A known limitation of traditional test generation techniques (e.g. Randoop and Evosuite) is that they produce test oracles that capture the actual program behaviour rather than the expected one. Recent approaches leverage Large Language Models (LLMs), trained on an enormous amount of data, to generate developer-like code and test cases. We investigate whether the LLM-generated test oracles capture the actual or expected software behaviour. We thus, conduct a controlled experiment to answer this question, by studying LLMs performance on two tasks, namely, test oracle classification and generation. The study includes developer-written and automatically generated test cases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Teaching and Learning Programming · Distributed and Parallel Computing Systems