EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

Anjiang Wei; Jiannan Cao; Ran Li; Hongyu Chen; Yuhui Zhang; Ziheng Wang; Yuan Liu; Thiago S. F. X. Teixeira; Diyi Yang; Ke Wang; Alex Aiken

arXiv:2502.12466·cs.LG·September 23, 2025

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago S. F. X. Teixeira, Diyi Yang, Ke Wang, Alex Aiken

PDF

Open Access 1 Video

TL;DR

EquiBench is a new benchmark designed to evaluate large language models' ability to understand program semantics through equivalence checking, revealing current models' limitations in semantic reasoning.

Contribution

We introduce EquiBench, a comprehensive benchmark with high-confidence program pairs for assessing LLMs' semantic reasoning in code.

Findings

01

State-of-the-art LLMs achieve modest accuracy on EquiBench.

02

Models often rely on syntactic cues rather than semantic understanding.

03

Current models show significant room for improvement in reasoning about program semantics.

Abstract

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EquiBench: Benchmarking Large Language Models’ Reasoning about Program Semantics via Equivalence Checking· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies