Self-Evaluation Improves Selective Generation in Large Language Models
Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan

TL;DR
This paper introduces a self-evaluation method for large language models that reformulates generation tasks into token-level predictions, improving the models' ability to assess and selectively generate content more reliably.
Contribution
It proposes a novel token-level self-evaluation approach that leverages LLMs' calibration, enhancing content quality assessment and selective generation capabilities.
Findings
Self-evaluation scores improve accuracy in content assessment.
Self-evaluation correlates better with overall content quality.
Method outperforms likelihood-based metrics in selective generation.
Abstract
Safe deployment of large language models (LLMs) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. While likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by LLMs as reliable indicators of generation quality. Conversely, LLMs have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. In this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage LLMs' superior calibration at the token level. We instruct an LLM to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Dropout · 15 Ways to Contact How can i speak to someone at Delta Airlines · Layer Normalization · Residual Connection · Weight Decay
