GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Ja Young Lee; M\'irian Silva; Mohamed Nasr; Shonda Witherspoon; Enzo Bozzani; Veronique Demers; Radha Ratnaparkhi; Hui Wu; Sara Rosenthal

arXiv:2603.18173·cs.CL·March 20, 2026

GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Ja Young Lee, M\'irian Silva, Mohamed Nasr, Shonda Witherspoon, Enzo Bozzani, Veronique Demers, Radha Ratnaparkhi, Hui Wu, Sara Rosenthal

PDF

Open Access

TL;DR

GRAFITE is a platform for continuous evaluation of large language models, addressing data contamination issues and enabling regression detection through user feedback and QA testing.

Contribution

It introduces a comprehensive system for ongoing LLM evaluation using user feedback, QA tests, and model comparison to detect performance regressions over time.

Findings

01

Enables side-by-side comparison of multiple LLMs

02

Detects performance regressions across model releases

03

Utilizes user feedback for issue repository

Abstract

Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Explainable Artificial Intelligence (XAI)