Predicting the Performance of Black-box LLMs through Follow-up Queries
Dylan Sam, Marc Finzi, J. Zico Kolter

TL;DR
This paper introduces a method to predict black-box language model behavior by using follow-up queries and response probabilities, enabling detection of correctness, adversarial manipulation, and model identity.
Contribution
It presents a novel approach that leverages follow-up question responses to reliably predict model correctness and detect adversarial or misrepresented models in black-box settings.
Findings
Linear models on follow-up responses predict correctness accurately.
Follow-up responses distinguish between clean and adversarially manipulated models.
Method outperforms some white-box predictors and detects model misrepresentation.
Abstract
Reliably predicting the behavior of language models -- such as whether their outputs are correct or have been adversarially manipulated -- is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses \emph{as} representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can \textit{even outperform white-box linear predictors} that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Data Mining Algorithms and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · {Dispute@FaQ-s}How to file a dispute with Expedia? · Layer Normalization · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Dense Connections
