TL;DR
SSFO introduces a self-supervised method to improve the faithfulness of retrieval-augmented generation models, reducing supervision costs and inference burdens while achieving state-of-the-art results across multiple datasets.
Contribution
It presents the first self-supervised alignment approach for RAG faithfulness, utilizing likelihood displacement and a modified DPO loss without additional supervision.
Findings
SSFO outperforms existing methods on multiple datasets.
It improves cross-lingual faithfulness and instruction-following.
The approach reduces supervision and inference costs.
Abstract
Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- They propose a self-supervised and efficient training method. Preference pairs are self‑generated, and only hundreds of examples suffice. - They conduct a systemic evaluation with several benchmarks from various aspects, including Robustness, Response Quality, Cross-language Response Quality, and Instruction Following Ability. - They prove that their method is effective across LLMs. - They not only propose the effective training method but also try to explain why their method works through be
Actually, I don't see many weaknesses in this paper. However, one question is about how the authors ensure that the no-context generations are indeed dispreferred responses. In some cases, no-context generations might still be factually correct and meaningfully not different from the context-grounded response. The paper could be clearer about how it guarantees that these no-context responses are dispreferred and meaningfully distinct from context-grounded responses. Moreover, they use GPT-4 as
1) The overall idea is simple, yet gives a strong empirical payoff, when doing self supervised the preference pairs are easy to generate and avoid costly human labels, this directly aligns to a context adherence principle. This reduces supervision and avoids extra inference burden. 2) The paper explains why encouraging displacement can be beneficial when preferred examples are silver, then introduces SSFO-lambda that rescores the DPO objective so the gradient puts stronger negative weight, whi
1) Results depend on retrieval quality, and the paper states a “standard RAG prompt” and datasets, but does not detail the retriever configuration or ablations to retrieval quality. 2) The authors report LFS which uses GPT-4 with a provided prompt, while standard, it introduces judge bias and lacks calibration against human labels or alternative factuality metrics, and there is no reliability analysis
- The method's primary strength is its simplicity. It avoids costly human or teacher annotation, and achieving significant gains from <1000 self-generated examples is a notable practical contribution. - The "benign likelihood displacement" concept provides a comfirmation of the method prioritizes contextual information over parametric memory. - The method demonstrates empirical gains on benchmarks specifically designed to test robustness against conflicting internal knowledge (NQ-Swap, MemoTrap)
- The entire SSFO framework is built on the critical, unstated assumption that the retrieved context is always the source of truth. The method explicitly trains the model to suppress its (potentially correct) internal knowledge in favor of any provided context. This is a major blind spot. In realistic RAG scenarios involving noisy, irrelevant, or factually incorrect context, SSFO would likely amplify this critical failure mode, forcing the model to "faithfully" repeat misinformation. The paper f
1. The proposed method outperforms the baselines without requiring any external ground-truth data. 2. The experiments cover multiple model families, demonstrating the generalization ability of the method. 3. The attribution of the performance gains to the benign application of likelihood displacement is insightful, and the corresponding analysis is reasonable.
1. Self-DPO is not a new technique, and similar approaches have already been explored. For example, [1] applies self-DPO to improve truthfulness. 2. The further proposed Shift-DPO is also introduced in prior work. Additionally, the paper claims that Shift-DPO “leads to a more pronounced suppression of the likelihood of the parametric response during optimization.” However, this seems inconsistent with the stated motivation of enabling better likelihood displacement, and the connection between
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
