On the Effectiveness of LLMs for Manual Test Verifications

Myron David Lucena Campos Peixoto; Davy de Medeiros Baia; Nathalia; Nascimento; Paulo Alencar; Baldoino Fonseca; and M\'arcio Ribeiro

arXiv:2409.12405·cs.SE·September 20, 2024

On the Effectiveness of LLMs for Manual Test Verifications

Myron David Lucena Campos Peixoto, Davy de Medeiros Baia, Nathalia, Nascimento, Paulo Alencar, Baldoino Fonseca, and M\'arcio Ribeiro

PDF

Open Access

TL;DR

This paper investigates the use of Large Language Models to generate manual test verifications, comparing their effectiveness and professional acceptance, and providing a dataset of generated verifications for future research.

Contribution

It presents a comprehensive evaluation of multiple LLMs for test verification generation and releases a large dataset of generated verifications for the software testing community.

Findings

01

Open-source LLMs perform comparably to closed-source models.

02

Professional testers agree with generated verifications slightly above 40%.

03

Some generated verifications outperform original ones, but hallucinations pose challenges.

Abstract

Background: Manual testing is vital for detecting issues missed by automated tests, but specifying accurate verifications is challenging. Aims: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests. Method: We conducted two independent and complementary exploratory studies. The first study involved using 2 closed-source and 6 open-source LLMs to generate verifications for manual test steps and evaluate their similarity to original verifications. The second study involved recruiting software testing professionals to assess their perception and agreement with the generated verifications compared to the original ones. Results: The open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReal-time simulation and control systems · Software Testing and Debugging Techniques

Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Softmax · Layer Normalization · Dropout