Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
Linhai Zhang, Yulan He

TL;DR
This paper introduces a test-time personalization framework for large language models that improves inference scalability by sampling multiple candidates, diagnosing reward model failures, and proposing a probabilistic reward model to enhance performance.
Contribution
It provides a theoretical analysis of test-time scaling, identifies failure modes of reward models, and proposes a probabilistic reward model to mitigate these issues.
Findings
Expected utility grows logarithmically with sample size under oracle selection.
Standard reward models often fail due to user-level collapse and reward hacking.
The proposed probabilistic reward model effectively mitigates failure modes and improves scaling.
Abstract
Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
