Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
Jazmia Henry

TL;DR
This paper introduces the Grounded Continuous Evaluation framework and ISOPro, a new method for evaluating language models that addresses key limitations of existing frameworks, improving reliability and reproducibility.
Contribution
It proposes ISOPro, replacing learned reward models with deterministic verifiers, and validates its effectiveness across multiple architectures and domains.
Findings
ISOPro eliminates reward hacking in verifiable domains.
ISOPro achieves larger capability gains compared to GRPO-LoRA.
ISOPro improves evaluation reproducibility and reduces hardware barriers.
Abstract
We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for deployed, agentic systems: distributional, temporal, scope, and process invalidity. These failures compound in RLHF, making reward hacking a predictable consequence of evaluation design rather than an unpredictable training pathology, and RLHF's dual-model architecture imposes a hardware barrier limiting evaluation reproducibility. We propose the Grounded Continuous Evaluation (GCE) framework and present ISOPro as a reference implementation. ISOPro replaces the learned reward model with a deterministic verifier, eliminating reward hacking by construction in verifiable-reward domains, and updates LoRA adapters on CPU, reducing the hardware barrier by an order of magnitude. We validate ISOPro across three architectures (Qwen 2.5 3B,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
