SAGE: Scalable AI Governance & Evaluation

Benjamin Le; Xueying Lu; Nick Stern; Wenqiong Liu; Igor Lapchuk; Xiang Li; Baofen Zheng; Kevin Rosenberg; Jiewen Huang; Zhe Zhang; Abraham Cabangbang; Satej Milind Wagle; Jianqiang Shen; Raghavan Muthuregunathan; Abhinav Gupta; Mathew Teoh; Andrew Kirk; Thomas Kwan; Jingwei Wu; and Wenjing Zhang

arXiv:2602.07840·cs.IR·February 11, 2026

SAGE: Scalable AI Governance & Evaluation

Benjamin Le, Xueying Lu, Nick Stern, Wenqiong Liu, Igor Lapchuk, Xiang Li, Baofen Zheng, Kevin Rosenberg, Jiewen Huang, Zhe Zhang, Abraham Cabangbang, Satej Milind Wagle, Jianqiang Shen, Raghavan Muthuregunathan, Abhinav Gupta, Mathew Teoh, Andrew Kirk, Thomas Kwan, Jingwei Wu

PDF

Open Access

TL;DR

SAGE is a scalable framework that enhances AI governance and relevance evaluation in large-scale search systems by combining human judgment, language models, and distillation techniques, leading to improved model oversight and user engagement.

Contribution

SAGE introduces a novel calibration loop integrating human policies, precedent, and language models to produce high-quality, scalable relevance judgments for AI systems.

Findings

01

Achieved near human-level agreement in relevance judgments.

02

Reduced evaluation costs by 92 times through distillation.

03

Enabled detection of regressions and policy oversight in production.

Abstract

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Advanced Graph Neural Networks · Multimodal Machine Learning Applications