Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Gregory Bolet; Giorgis Georgakoudis; Konstantinos Parasyris; Harshitha Menon; Niranjan Hasabnis; Kirk W. Cameron; Gal Oren

arXiv:2512.04355·cs.DC·December 5, 2025

Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Gregory Bolet, Giorgis Georgakoudis, Konstantinos Parasyris, Harshitha Menon, Niranjan Hasabnis, Kirk W. Cameron, Gal Oren

PDF

Open Access

TL;DR

This paper introduces gpuFLOPBench, a benchmark for evaluating LLMs' ability to predict GPU kernel FLOP counts, revealing current models' limitations in reasoning about implicit hardware-specific behaviors.

Contribution

It presents gpuFLOPBench, a novel benchmark for assessing LLMs' reasoning about GPU code complexity and performance prediction.

Findings

01

Current LLMs excel at simple FLOP counting but struggle with implicit behaviors.

02

Implicit FLOP prediction errors can be several orders of magnitude.

03

Benchmark highlights core limitations in LLM reasoning about hardware-specific effects.

Abstract

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today's Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to "count without running" by predicting single and double-precision FLOP counts for 577 CUDA kernels drawn from HeCBench, annotated with ground-truth profiles and eight execution attributes that distinguish trivially analyzable code from kernels whose FLOPs depend on hidden compiler or runtime behavior. Evaluating current closed-source reasoning models shows clear but uneven progress: the newest LLMs achieve perfect classification on straightforward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Logic, programming, and type systems · Big Data and Digital Economy