Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Carolina Carreira; \'Alvaro Silva; Alexandre Abreu; and Alexandra Mendes

arXiv:2506.22370·cs.SE·September 9, 2025

Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Carolina Carreira, \'Alvaro Silva, Alexandre Abreu, and Alexandra Mendes

PDF

Open Access

TL;DR

This study explores how large language models like ChatGPT can assist students in formal verification tasks using Dafny, revealing improved performance with quality prompts and offering practical integration strategies.

Contribution

It provides empirical insights into student interactions with LLMs in formal verification, highlighting effective strategies and trust factors, and offers practical recommendations for educational use.

Findings

01

Students perform better with ChatGPT assistance.

02

Prompt quality significantly impacts performance.

03

Trust in LLMs influences student reliance and effectiveness.

Abstract

Students in computing education increasingly use large language models (LLMs) such as ChatGPT. Yet, the role of LLMs in supporting cognitively demanding tasks, like deductive program verification, remains poorly understood. This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny, a language that supports functional correctness, by allowing programmers to write formal specifications and automatically verifying that the implementation satisfies the specification. We conducted a mixed-methods study with master's students enrolled in a formal methods course. Each participant completed two verification problems, one with access to a custom ChatGPT interface that logged all interactions, and the other without. We identified strategies used by successful students and assessed the level of trust students place in LLMs. Our findings show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software System Performance and Reliability