What Is Missing: Interpretable Ratings for Large Language Model Outputs
Nicholas Stranges, Yimin Yang

TL;DR
The paper introduces the WIM rating system, which uses natural language feedback to produce interpretable and more effective preference ratings for LLM outputs, improving training signals and enabling qualitative debugging.
Contribution
WIM provides a novel, interpretable rating method using natural language feedback that integrates seamlessly into existing preference learning pipelines.
Findings
WIM yields fewer ties and larger rating differences than numerical ratings.
WIM improves the availability of learning signals in preference data.
WIM enables qualitative debugging through interpretability of feedback.
Abstract
Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically…
Peer Reviews
Decision·Submitted to ICLR 2026
The strength is the novelty of the idea that tries to set up an adversary-like LLM.
Overall, I feel that the paper has a nice idea but the evidence is on the weak side and the evaluation is far from comprehensive. - There is a frequent use of passive voice, which makes it difficult to immediately catch if the subject is the authors or existing literature or someone else. There are excellent web articles about why passive voice can make the writing less clear. Please consider switching to active voice to improve readability. - Figure 3 does not provide a fair comparison. The pr
- The paper is well motivated, in that an automatic method to determine preferences would be beneficial.
- The premise of the paper is counter-intuitive. The proposed metric seems to be limited to instances were preference can be determined through key missing information. However, in those cases the human provided preferences should not be ambiguous nor subjective. In contrast, in cases where there is low human agreement, e.g. preference over stylistic choices in language, there would be no missing information for this metric to capture. - The qualitative analysis is vague and based on undisclose
S1. The framework proposed is flexible enough to be adapted to a plethora of preference optimization (PO) methods, with the possibility to use either human or LLM-based judges. S2. The method tackles the lack of interpretability and expressiveness in preference ratings, a well-known limitation in PO methods.
The paper mainly lacks rigorous experiments to support the claimed benefits of the model, such as how it compares to strong, simpler baselines (PPO, DPO) in well-stablished experimental setups for preference optimization (e.g. HH, TL;DR, AlpacaEval2, among many others) As such, it is difficult to make grounded conclusions about the contributions of this paper.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
