CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking

Neeva Oza; Ishaan Govil; Parul Gupta; Dinesh Khandelwal; Dinesh Garg; Parag Singla

arXiv:2506.04019·cs.SE·June 5, 2025

CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking

Neeva Oza, Ishaan Govil, Parul Gupta, Dinesh Khandelwal, Dinesh Garg, Parag Singla

PDF

Open Access

TL;DR

This paper introduces CETBench, a dataset for evaluating LLMs on code-equivalence checking, revealing that simple code transformations can significantly challenge current models, and proposes fine-tuning to improve performance.

Contribution

The paper presents CETBench, a novel dataset constructed via code transformations, and demonstrates the effectiveness of fine-tuning LLMs to improve code-equivalence detection.

Findings

01

Simple code transformations can significantly reduce LLM performance.

02

Fine-tuning improves LLM accuracy on transformed code pairs.

03

CETBench is versatile for varying program difficulties and transformations.

Abstract

LLMs have been extensively used for the task of automated code generation. In this work, we examine the applicability of LLMs for the related but relatively unexplored task of code-equivalence checking, i.e., given two programs, whether they are functionally equivalent or not. This is an important problem since benchmarking code equivalence can play a critical role in evaluating LLM capabilities for tasks such as code re-writing and code translation. Towards this end, we present CETBench - Code Equivalence with Transformations Benchmark, constructed via a repository of programs, where two programs in the repository may be solving the same or different tasks. Each instance in our dataset is obtained by taking a pair of programs in the repository and applying a random series of pre-defined code transformations, resulting in (non-)equivalent pairs. Our analysis on this dataset reveals a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Logic, programming, and type systems