BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation

Peng Sun; Xiangyu Zhang; Duan Wu

arXiv:2601.18253·cs.CL·January 27, 2026

BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation

Peng Sun, Xiangyu Zhang, Duan Wu

PDF

Open Access

TL;DR

BoRP is a scalable, high-fidelity evaluation framework for conversational AI that uses latent space properties and bootstrapping to reliably measure user satisfaction, outperforming generative baselines and reducing costs.

Contribution

Introduces BoRP, a novel latent space-based bootstrapped regression probing method for scalable, human-aligned satisfaction evaluation of large language models.

Findings

01

BoRP outperforms generative baselines in human alignment.

02

It significantly reduces inference costs.

03

Enables sensitive A/B testing with full-scale monitoring.

Abstract

Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning