BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation
Peng Sun, Xiangyu Zhang, Duan Wu

TL;DR
BoRP is a scalable, high-fidelity evaluation framework for conversational AI that uses latent space properties and bootstrapping to reliably measure user satisfaction, outperforming generative baselines and reducing costs.
Contribution
Introduces BoRP, a novel latent space-based bootstrapped regression probing method for scalable, human-aligned satisfaction evaluation of large language models.
Findings
BoRP outperforms generative baselines in human alignment.
It significantly reduces inference costs.
Enables sensitive A/B testing with full-scale monitoring.
Abstract
Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
