Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Wataru Hirota; Tomoki Taniguchi; Tomoko Ohkuma; Kosuke Takahashi; Takahiro Omi; Kosuke Arima; Takuto Asakura; Chung-Chi Chen; Tatsuya Ishigaki

arXiv:2604.22517·cs.CL·April 27, 2026

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Takuto Asakura, Chung-Chi Chen, Tatsuya Ishigaki

PDF

TL;DR

This study compares aggregate and personalized judging approaches for evaluating business ideas with expert disagreement, finding personalized judges better match individual evaluator judgments and suggesting new evaluation methods.

Contribution

It introduces a dataset of expert scores and demonstrates that personalized judges outperform aggregate models in aligning with individual evaluator judgments.

Findings

01

Personalized judges align more closely with individual evaluator scores.

02

Expert disagreement is higher on fine-grained scores but lower on coarse selections.

03

Evaluator agreement correlates with similarity of reasoning only in personalized models.

Abstract

Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.