RPRA: Predicting an LLM-Judge for Efficient but Performant Inference
Dylan R. Ashley, Ga\"el Le Lan, Changsheng Zhao, Naina Dhingra, Zhipeng Cai, Ernie Chang, Mingchen Zhuge, Yangyang Shi, Vikas Chandra, J\"urgen Schmidhuber

TL;DR
This paper explores methods for smaller language models to predict their own performance using larger model judges, enabling more efficient and accurate responses on limited devices.
Contribution
It introduces the RPRA paradigm and evaluates approaches for models to predict LLM judges, improving smaller models' self-assessment capabilities.
Findings
Larger models perform well as judges in zero-shot settings.
Smaller models improve prediction accuracy after fine-tuning or report cards.
Report cards and fine-tuning significantly enhance prediction accuracy.
Abstract
Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict -- prior to responding -- how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
