An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc

Hong Zhang; Barry Smith; Satish Balay; Le Chen; Murat Keceli; Lois Curfman McInnes; Junchao Zhang

arXiv:2603.15976·cs.AI·March 18, 2026

An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc

Hong Zhang, Barry Smith, Satish Balay, Le Chen, Murat Keceli, Lois Curfman McInnes, Junchao Zhang

PDF

Open Access

TL;DR

This paper introduces petscagent-bench, an agentic evaluation framework for AI-generated scientific code in PETSc, enabling comprehensive, black-box assessment across multiple scoring categories for HPC applications.

Contribution

It presents a novel agentic evaluation framework that assesses AI-generated scientific code in HPC, addressing limitations of traditional benchmarks by evaluating correctness, performance, and conventions.

Findings

01

Current models produce readable code but struggle with library-specific conventions.

02

The framework enables black-box, multi-faceted evaluation of code quality.

03

Traditional benchmarks miss critical aspects like API conventions and performance.

Abstract

While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach insufficient for library code in HPC where solver selection, API conventions, memory management, and performance are just as critical as functional correctness. To address this gap, we introduce petscagent-bench, an agentic framework built on an agents-evaluating-agents paradigm. Instead of relying on static scripts, petscagent-bench deploys a tool-augmented evaluator agent that compiles, executes, and measures code produced by a separate model-under-test agent, orchestrating a 14-evaluator pipeline across five scoring categories: correctness, performance, code quality, algorithmic appropriateness, and library-specific conventions. Because the agents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Model-Driven Software Engineering Techniques · Software Engineering Research