Learning from Self Critique and Refinement for Faithful LLM Summarization
Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Cem Koc, Oncel Tuzel, Raviteja Vemulapalli

TL;DR
This paper introduces SCRPO, a self-supervised training method that enhances LLMs' faithfulness in summarization by leveraging their own critique and refinement capabilities, outperforming existing methods in faithfulness and efficiency.
Contribution
Proposes SCRPO, a novel self-supervised framework that uses LLMs' own critique to improve summarization faithfulness without extra test-time computation.
Findings
Outperforms state-of-the-art self-supervised methods in faithfulness metrics.
Achieves more faithful summaries compared to test-time refinement.
Maintains or improves overall summary quality.
Abstract
Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can be reduced by iteratively critiquing and refining previously generated outputs using either the same model or a more powerful teacher model as the critique. However, these approaches either require additional test-time compute or assume access to more powerful teacher models, making them costly and less practical. In this work, we propose Self Critique and Refinement-based Preference Optimization (SCRPO), which is a self-supervised training framework that first constructs a preference dataset by leveraging the LLM's own critique and refinement capabilities, and then applies preference learning to improve the same LLM for faithful summarization.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Integrates self-critique and refinement into preference optimization. 2. Eliminates inference-time overhead while retaining refinement benefits. 3.Well-structured exposition, strong figures, and detailed appendices with prompts.
1. Human evaluation only contrasts SCRPO with the pretrained model, not with other baselines (e.g., SCOPE, MPO). This limits the strength of claims about outperforming state-of-the-art systems in human preference.
* The proposed method is technically sound and intuitively reasonable. The design is lightweight and clean. * The performance gain is consistent across datasets and metrics. * Ablation and analysis are extensive, exploring various aspects relating to the design choices of the proposed framework.
* In the ablation on model size, SCRPO decreases the faithfulness for the 0.5B and 1.5B models, and achieves only marginal improvement on the 3B model. This raises some concerns about the general applicability of SCRPO. Related to this, it would be useful to include results on larger models, as demonstrating effectiveness across mid- and large-scale models would suggest broader generalizability while restrictions to specific model size bands would indicate limited scope. * The technical novelty
* The paper is well-motivated and features a clear, easy-to-follow flow of presentation. * The proposed approach, SCRPO, is intuitive in the context of related work and is thoroughly discussed through various design choices and ablation studies. * SCRPO demonstrates consistent improvements across different benchmarks in terms of both faithfulness metrics and the overall summary quality reflected by automatic evaluations.
* Although the improvements across benchmarks are consistent, the findings are not fully convincing, since: * the paper does not mention whether the results are stable over multiple runs or supported by significance testing; * the approach was only evaluated on Qwen2.5, leaving its effectiveness on other backbone LLMs unverified. * The human evaluation remains limited, as it does not include results on CNN/DM or SAMSum or comparisons with other baselines, nor does it report inter-annotator ag
* a novel framework is proposed for improving faithfulness of summaries in a completely self-supervised way * SCRPO outperforms base models and alternative techniques, even those not completely self-supervised
* the effectiveness of the technique is limited by the model's self-critique ability - which is demonstrated in the model size ablation study where using SCRPO actually hurts smaller models' (0.5B, 1.5B) performance * given the point above, an experiment with a stronger teacher model (e.g. GPT / Claude) as preference pair generator would provide a fuller picture of the evaluation. That would show how much reliance on the model's own judgements limits the performance of the main setup evaluated,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Sentiment Analysis and Opinion Mining
