Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

Yo Ehara

arXiv:2605.12422·cs.CL·May 13, 2026

Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

Yo Ehara

PDF

TL;DR

This paper introduces a novel method to predict when LLM-generated difficulty ratings will disagree with human raters, using embedding space analysis instead of generation-time probabilities, improving prediction accuracy.

Contribution

The proposed approach predicts rating disagreements without relying on generation-time signals, leveraging geometric consistency in embedding space for better disagreement detection.

Findings

01

Higher AUC achieved in predicting disagreement compared to probability-based baselines.

02

Effective on English CEFR-based sentence difficulty assessment datasets.

03

Applicable to multiple LLMs like GPT-OSS-120B and Qwen3-235B-A22B.

Abstract

Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.