Disproving Program Equivalence with LLMs

Miltiadis Allamanis; Pengcheng Yin

arXiv:2502.18473·cs.SE·February 27, 2025

Disproving Program Equivalence with LLMs

Miltiadis Allamanis, Pengcheng Yin

PDF

Open Access

TL;DR

This paper introduces ProbeGen, a whitebox method leveraging LLMs with execution feedback to identify counterexamples in code equivalence, revealing limitations of unit test benchmarks and improving semantic clustering.

Contribution

The work presents ProbeGen, a novel whitebox approach that effectively disproves code equivalence and enhances semantic clustering of LLM-generated code samples.

Findings

01

ProbeGen disapproves 18% of code samples deemed equivalent by unit tests.

02

Using ProbeGen improves pass@1 by 10% through semantic clustering.

03

LLMs with execution feedback excel at understanding code semantics.

Abstract

To evaluate large language models (LLMs) for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research · Software Testing and Debugging Techniques