Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan

TL;DR
This paper introduces PCFJudge, a method that reruns listwise factuality evaluations with different candidate orderings and aggregates results, significantly improving LLM factuality judgment accuracy.
Contribution
The paper proposes a novel inference-time approach that reduces candidate-order sensitivity in LLM factuality evaluation by aggregating multiple orderings.
Findings
Improved top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4.
Enhanced accuracy from 86.33% to 89.67% with Claude Sonnet 4.6.
Candidate order significantly impacts factuality judgment reliability.
Abstract
Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
