References Improve LLM Alignment in Non-Verifiable Domains

Kejian Shi; Yixin Liu; Peifeng Wang; Alexander R. Fabbri; Shafiq Joty; Arman Cohan

arXiv:2602.16802·cs.CL·February 20, 2026

References Improve LLM Alignment in Non-Verifiable Domains

Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that reference-guided LLM evaluators significantly improve alignment and self-improvement in non-verifiable domains, achieving performance comparable to reward models through a novel evaluation protocol.

Contribution

It introduces a reference-guided evaluation approach that enhances LLM judges and enables effective self-improvement in non-verifiable domains, bridging the gap in alignment methods.

Findings

01

Reference-guided evaluators improve judgment accuracy.

02

Enhanced judges lead to better LLM alignment performance.

03

Method achieves comparable results to strong reward models.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

* Evaluation across diverse datasets (Natural, Adversarial, MTBench, Instrusum, HREF, AlpacaEval and ArenaHard) and models (Qwen, Llama, GPT, Gemma, GLM, etc) * Useful settings and approaches. RefEval provides explicit guidance on using reference outputs as benchmarks for instruction-following quality. RefMatch provides a good baseline to compare.

Weaknesses

I mostly concern experiment design. * Missing comparison to recent reference-based methods like RevisEval beyond a brief mention * No systematic study of how reference quality affects downstream performance * Unclear how to obtain high-quality references in domains where frontier models struggle * No analysis of how reference diversity affects evaluation robustness, especially for aspects that reference responses miss.

Reviewer 02Rating 4Confidence 3

Strengths

The paper asks a clear, practical question—can we ground LLM judges and training with strong references—and answers it with simple, reproducible tooling (Ref-Free, RefEval, RefMatch) and a clean SFT→DPO pipeline. The experimental setup is broad (many judges, several datasets) and the gains are consistent: reference-guided judging improves agreement/utility, and reference-distilled SFT followed by preference optimization moves mid-size models meaningfully on common leaderboards. I also like that

Weaknesses

Methodologically, the novelty feels incremental: reference-guided evaluation and reference-distilled training have both appeared before, and the paper mostly scales and systematizes them rather than introducing a new objective or learning signal. The comparisons also underplay strong preference-optimization baselines (e.g., SimPO, ORPO, KTO) and fine-grained supervision relevant to credit assignment (token/segment-level DPO; span-supervised MT like TWA). Robustness is not fully stress-tested:

Reviewer 03Rating 10Confidence 3

Strengths

1. The proposed approach is a step towards something similar to RLVR for non-verifiable domains, which makes this work interesting to the broad community. 2. Comparisons with strong baselines from literature are provided. In addition the authors design a strong reference-free baseline that is directly comparable to their referenced based approach in terms of prompt quality. 3. Experiments with sources from different frontier LLMs are conducted so that the results are not specific to a particular

Weaknesses

1. The baselines in table 4 are not as strong. What I mean is that some of the baselines for table 4 should probably have been based on the baselines in table 1. If we are to choose a evaluator approach based on results from table 1, we would like to know whether the results in table 1 can serve as a robust tool for choosing a llm evaluator.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)