On the Effectiveness of LLMs for Manual Test Verifications
Myron David Lucena Campos Peixoto, Davy de Medeiros Baia, Nathalia, Nascimento, Paulo Alencar, Baldoino Fonseca, and M\'arcio Ribeiro

TL;DR
This paper investigates the use of Large Language Models to generate manual test verifications, comparing their effectiveness and professional acceptance, and providing a dataset of generated verifications for future research.
Contribution
It presents a comprehensive evaluation of multiple LLMs for test verification generation and releases a large dataset of generated verifications for the software testing community.
Findings
Open-source LLMs perform comparably to closed-source models.
Professional testers agree with generated verifications slightly above 40%.
Some generated verifications outperform original ones, but hallucinations pose challenges.
Abstract
Background: Manual testing is vital for detecting issues missed by automated tests, but specifying accurate verifications is challenging. Aims: This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests. Method: We conducted two independent and complementary exploratory studies. The first study involved using 2 closed-source and 6 open-source LLMs to generate verifications for manual test steps and evaluate their similarity to original verifications. The second study involved recruiting software testing professionals to assess their perception and agreement with the generated verifications compared to the original ones. Results: The open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models like Gemini-1.5-flash and GPT-3.5-turbo in generating manual test verifications.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReal-time simulation and control systems · Software Testing and Debugging Techniques
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Softmax · Layer Normalization · Dropout
