Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen; Zeyu Ji; Qianren Mao; Hao Wu; Jinhuan Song; Junhang Cheng; Bangjie Qin; Zhuoran Li; Jingzheng Li; Kai Sun; Zizhe Wang; Yikun Ban; Zhu Sun; Xiangyang Ji; Hailong Sun

arXiv:2512.23213·cs.CL·April 28, 2026

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Jinhuan Song, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun

PDF

TL;DR

This paper introduces LLM-PeerReview, an unsupervised ensemble method that selects the best response from multiple large language models using a peer-review-inspired, transparent framework, improving performance across various tasks.

Contribution

It presents a novel, fully unsupervised peer-review-inspired ensemble approach that effectively combines multiple LLM responses with interpretability and adaptability.

Findings

01

Outperforms the advanced model Smoothie-Global by 6.9% and 7.3% points.

02

Works effectively across factual recall, math reasoning, and instruction following tasks.

03

Uses LLM-as-a-Judge and a graphical model-based inference for response selection.

Abstract

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.