CRQBench: A Benchmark of Code Reasoning Questions
Elizabeth Dinella, Satish Chandra, and Petros Maniatis

TL;DR
CRQBench is a new benchmark comprising 100 C++ code reasoning questions derived from code review comments, designed to evaluate the reasoning ability of large language models like GPT-4 more accurately.
Contribution
This paper introduces CRQBench, a novel benchmark for code reasoning, created using an LLM-assisted curation process to better assess model reasoning skills.
Findings
GPT-4 answers correctly for 65 out of 100 questions.
CRQBench addresses limitations of existing benchmarks by focusing on realistic reasoning tasks.
The benchmark facilitates more precise evaluation of code reasoning abilities.
Abstract
Large Language Models have demonstrated exceptional proficiency on coding tasks, but it is challenging to precisely evaluate their code reasoning ability. Existing benchmarks are insufficient as they are unrealistic and conflate semantic reasoning ability with performance on software engineering tasks. We introduce CRQBench, a benchmark of 100 C++ code reasoning questions and answers derived from contextualized code review comments. To curate CRQBench, we use an LLM assistant alongside human inspection, reducing manual effort. We conduct an evaluation of GPT-4 on CRQBench and find that it produces correct responses grounded in the given context for 65 of the 100 questions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Business Process Modeling and Analysis
