Ensuring Reproducibility in Generative AI Systems for General Use Cases:   A Framework for Regression Testing and Open Datasets

Masumi Morishige; Ryo Koshihara

arXiv:2505.02854·cs.CL·May 7, 2025

Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets

Masumi Morishige, Ryo Koshihara

PDF

Open Access 1 Repo

TL;DR

This paper presents GPR-bench, a benchmark for testing the reproducibility of generative AI systems across multiple tasks and languages, using automated evaluation to detect model drift and assess prompt engineering effects.

Contribution

It introduces GPR-bench, an open, multilingual benchmark with an automated scoring pipeline for regression testing of generative AI models in general use cases.

Findings

01

Newer models show modest correctness improvements.

02

Concise prompts significantly improve conciseness.

03

Benchmark reveals minimal accuracy degradation with prompt changes.

Abstract

Reproducibility and reliability remain pressing challenges for generative AI systems whose behavior can drift with each model update or prompt revision. We introduce GPR-bench, a lightweight, extensible benchmark that operationalizes regression testing for general purpose use cases. GPR-bench couples an open, bilingual (English and Japanese) dataset covering eight task categories (e.g., text generation, code generation, and information retrieval) and 10 scenarios in each task categories (80 total test cases for each language) with an automated evaluation pipeline that employs "LLM-as-a-Judge" scoring of correctness and conciseness. Experiments across three recent model versions - gpt-4o-mini, o3-mini, and o4-mini - and two prompt configurations (default versus concise-writing instruction) reveal heterogeneous quality. Our results show that newer models generally improve correctness, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

galirage/gpr-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification