References Improve LLM Alignment in Non-Verifiable Domains
Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan

TL;DR
This paper demonstrates that reference-guided LLM evaluators significantly improve alignment and self-improvement in non-verifiable domains, achieving performance comparable to reward models through a novel evaluation protocol.
Contribution
It introduces a reference-guided evaluation approach that enhances LLM judges and enables effective self-improvement in non-verifiable domains, bridging the gap in alignment methods.
Findings
Reference-guided evaluators improve judgment accuracy.
Enhanced judges lead to better LLM alignment performance.
Method achieves comparable results to strong reward models.
Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to…
Peer Reviews
Decision·ICLR 2026 Poster
* Evaluation across diverse datasets (Natural, Adversarial, MTBench, Instrusum, HREF, AlpacaEval and ArenaHard) and models (Qwen, Llama, GPT, Gemma, GLM, etc) * Useful settings and approaches. RefEval provides explicit guidance on using reference outputs as benchmarks for instruction-following quality. RefMatch provides a good baseline to compare.
I mostly concern experiment design. * Missing comparison to recent reference-based methods like RevisEval beyond a brief mention * No systematic study of how reference quality affects downstream performance * Unclear how to obtain high-quality references in domains where frontier models struggle * No analysis of how reference diversity affects evaluation robustness, especially for aspects that reference responses miss.
The paper asks a clear, practical question—can we ground LLM judges and training with strong references—and answers it with simple, reproducible tooling (Ref-Free, RefEval, RefMatch) and a clean SFT→DPO pipeline. The experimental setup is broad (many judges, several datasets) and the gains are consistent: reference-guided judging improves agreement/utility, and reference-distilled SFT followed by preference optimization moves mid-size models meaningfully on common leaderboards. I also like that
Methodologically, the novelty feels incremental: reference-guided evaluation and reference-distilled training have both appeared before, and the paper mostly scales and systematizes them rather than introducing a new objective or learning signal. The comparisons also underplay strong preference-optimization baselines (e.g., SimPO, ORPO, KTO) and fine-grained supervision relevant to credit assignment (token/segment-level DPO; span-supervised MT like TWA). Robustness is not fully stress-tested:
1. The proposed approach is a step towards something similar to RLVR for non-verifiable domains, which makes this work interesting to the broad community. 2. Comparisons with strong baselines from literature are provided. In addition the authors design a strong reference-free baseline that is directly comparable to their referenced based approach in terms of prompt quality. 3. Experiments with sources from different frontier LLMs are conducted so that the results are not specific to a particular
1. The baselines in table 4 are not as strong. What I mean is that some of the baselines for table 4 should probably have been based on the baselines in table 1. If we are to choose a evaluator approach based on results from table 1, we would like to know whether the results in table 1 can serve as a robust tool for choosing a llm evaluator.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
