CRQBench: A Benchmark of Code Reasoning Questions

Elizabeth Dinella; Satish Chandra; and Petros Maniatis

arXiv:2408.08453·cs.SE·August 19, 2024

CRQBench: A Benchmark of Code Reasoning Questions

Elizabeth Dinella, Satish Chandra, and Petros Maniatis

PDF

Open Access

TL;DR

CRQBench is a new benchmark comprising 100 C++ code reasoning questions derived from code review comments, designed to evaluate the reasoning ability of large language models like GPT-4 more accurately.

Contribution

This paper introduces CRQBench, a novel benchmark for code reasoning, created using an LLM-assisted curation process to better assess model reasoning skills.

Findings

01

GPT-4 answers correctly for 65 out of 100 questions.

02

CRQBench addresses limitations of existing benchmarks by focusing on realistic reasoning tasks.

03

The benchmark facilitates more precise evaluation of code reasoning abilities.

Abstract

Large Language Models have demonstrated exceptional proficiency on coding tasks, but it is challenging to precisely evaluate their code reasoning ability. Existing benchmarks are insufficient as they are unrealistic and conflate semantic reasoning ability with performance on software engineering tasks. We introduce CRQBench, a benchmark of 100 C++ code reasoning questions and answers derived from contextualized code review comments. To curate CRQBench, we use an LLM assistant alongside human inspection, reducing manual effort. We conduct an evaluation of GPT-4 on CRQBench and find that it produces correct responses grounded in the given context for 65 of the 100 questions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Business Process Modeling and Analysis