Clover: Closed-Loop Verifiable Code Generation

Chuyue Sun; Ying Sheng; Oded Padon; Clark Barrett

arXiv:2310.17807·cs.AI·November 19, 2024·1 cites

Clover: Closed-Loop Verifiable Code Generation

Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett

PDF

Open Access 2 Repos 4 Reviews

TL;DR

Clover introduces a closed-loop verification approach for code generation that uses consistency checks among code, documentation, and formal annotations to improve correctness and reliability of generated code.

Contribution

The paper presents a novel framework combining formal verification with large language models for verifiable code generation, supported by theoretical analysis and empirical evaluation.

Findings

01

LLMs can effectively generate formal specifications.

02

The consistency checker achieves up to 87% acceptance rate for correct code.

03

Clover identified errors in existing human-written datasets.

Abstract

The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to undesirable outcomes. In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code. Clover performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its performance on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

+ An important problem + Easy follow paper

Weaknesses

- Unsound design - Small-scale experiment - Lack important details

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- This paper explores a method to rigourously verify the artifacts generated by Code-LLMs. This problem should be of wide interest, but there seem to be little efforts in this direction. - Clover exhibits high acceptance rates for correctly generated artifacts while precisely rejecting the incorrect ones. - The paper contributes CloverBench, which could be a useful benchmark in the future.

Weaknesses

- It seems Clover will need to depend on very strong LLMs (e.g., to generate code/docstring from just the annotations). Thus, consistency checks may often fail if the LLM is weak. - Dependence on a verification tool. (Can it check any given java file or just short program snippets?) - A very small evaluation set of just 60 examples.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The paper is very nicely written and proposes a well-principled approach to verify AI-generated codes. - Particularly, I found the insight interesting where the authors propose that AI-generated codes should inherently also include formal specifications (annotation viz. precondition, postcondition, loop-invariants) and natural language descriptions (docstring). This is a principled strategy that essentially asks an LLM to provide some form of certification justifying what it generated. - The

Weaknesses

- Although the paper proposes a good concept for program verification, it's still in a nascent stage and has a long way to go for real-world deployment. I understand that this is a first step towards a bigger objective when the author mentions that "we predict that Clover, coupled with steadily improving generative AI and formal tools, will enable a future in which fully automatic, scalable generation of formally verified code is feasible. This paper charts the first steps towards realizing this

Reviewer 04Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

- An interesting approach for making generated code carry supporting evidence for its correctness. - A benchmark suite that can enable further experimentation on (code,docstring,annotation) triplets or subsets thereof. - Clear and compelling presentation.

Weaknesses

- I like the approach, but I'm afraid that the evaluation really falls short. The main weakness is that CloverBench contains classical examples for which GPT4 is clearly able to generate docstring, invariants, and code quite easily from each component of the triplet. In fact, I am guessing that it has seen different versions of the loop-invariant for each of these examples multiple times in different styles and languages. - The weakest link in the Clover approach itself is the natural language d

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research