Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Tianyi Huang; Nathan Huang; Justin Tang; Wenqian Chen; Elsa Fan

arXiv:2603.20562·cs.CL·May 19, 2026

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan

PDF

TL;DR

This paper introduces PCFJudge, a method that reruns listwise factuality evaluations with different candidate orderings and aggregates results, significantly improving LLM factuality judgment accuracy.

Contribution

The paper proposes a novel inference-time approach that reduces candidate-order sensitivity in LLM factuality evaluation by aggregating multiple orderings.

Findings

01

Improved top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4.

02

Enhanced accuracy from 86.33% to 89.67% with Claude Sonnet 4.6.

03

Candidate order significantly impacts factuality judgment reliability.

Abstract

Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.