Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny
Carolina Carreira, \'Alvaro Silva, Alexandre Abreu, and Alexandra Mendes

TL;DR
This study explores how large language models like ChatGPT can assist students in formal verification tasks using Dafny, revealing improved performance with quality prompts and offering practical integration strategies.
Contribution
It provides empirical insights into student interactions with LLMs in formal verification, highlighting effective strategies and trust factors, and offers practical recommendations for educational use.
Findings
Students perform better with ChatGPT assistance.
Prompt quality significantly impacts performance.
Trust in LLMs influences student reliance and effectiveness.
Abstract
Students in computing education increasingly use large language models (LLMs) such as ChatGPT. Yet, the role of LLMs in supporting cognitively demanding tasks, like deductive program verification, remains poorly understood. This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny, a language that supports functional correctness, by allowing programmers to write formal specifications and automatically verifying that the implementation satisfies the specification. We conducted a mixed-methods study with master's students enrolled in a formal methods course. Each participant completed two verification problems, one with access to a custom ChatGPT interface that logged all interactions, and the other without. We identified strategies used by successful students and assessed the level of trust students place in LLMs. Our findings show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software System Performance and Reliability
