Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets
Masumi Morishige, Ryo Koshihara

TL;DR
This paper presents GPR-bench, a benchmark for testing the reproducibility of generative AI systems across multiple tasks and languages, using automated evaluation to detect model drift and assess prompt engineering effects.
Contribution
It introduces GPR-bench, an open, multilingual benchmark with an automated scoring pipeline for regression testing of generative AI models in general use cases.
Findings
Newer models show modest correctness improvements.
Concise prompts significantly improve conciseness.
Benchmark reveals minimal accuracy degradation with prompt changes.
Abstract
Reproducibility and reliability remain pressing challenges for generative AI systems whose behavior can drift with each model update or prompt revision. We introduce GPR-bench, a lightweight, extensible benchmark that operationalizes regression testing for general purpose use cases. GPR-bench couples an open, bilingual (English and Japanese) dataset covering eight task categories (e.g., text generation, code generation, and information retrieval) and 10 scenarios in each task categories (80 total test cases for each language) with an automated evaluation pipeline that employs "LLM-as-a-Judge" scoring of correctness and conciseness. Experiments across three recent model versions - gpt-4o-mini, o3-mini, and o4-mini - and two prompt configurations (default versus concise-writing instruction) reveal heterogeneous quality. Our results show that newer models generally improve correctness, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification
