Understanding LLM-Driven Test Oracle Generation

Adam Bodicoat; Gunel Jahangirova; Valerio Terragni

arXiv:2601.05542·cs.SE·January 12, 2026

Understanding LLM-Driven Test Oracle Generation

Adam Bodicoat, Gunel Jahangirova, Valerio Terragni

PDF

Open Access

TL;DR

This paper explores how Large Language Models can be used to generate test oracles that better reflect intended software behavior, addressing a key challenge in automated testing.

Contribution

It provides an empirical analysis of LLM-based test oracle generation, examining prompting strategies and contextual inputs to improve oracle quality.

Findings

01

LLMs can generate effective test oracles that detect software failures.

02

Prompting strategies significantly influence oracle quality.

03

Limitations of LLM-generated oracles highlight areas for future research.

Abstract

Automated unit test generation aims to improve software quality while reducing the time and effort required for creating tests manually. However, existing techniques primarily generate regression oracles that predicate on the implemented behavior of the class under test. They do not address the oracle problem: the challenge of distinguishing correct from incorrect program behavior. With the rise of Foundation Models (FMs), particularly Large Language Models (LLMs), there is a new opportunity to generate test oracles that reflect intended behavior. This positions LLMs as enablers of Promptware, where software creation and testing are driven by natural-language prompts. This paper presents an empirical study on the effectiveness of LLMs in generating test oracles that expose software failures. We investigate how different prompting strategies and levels of contextual input impact the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Software Engineering Research