Self-Evaluation Improves Selective Generation in Large Language Models

Jie Ren; Yao Zhao; Tu Vu; Peter J. Liu; Balaji Lakshminarayanan

arXiv:2312.09300·cs.CL·December 18, 2023·2 cites

Self-Evaluation Improves Selective Generation in Large Language Models

Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan

PDF

Open Access

TL;DR

This paper introduces a self-evaluation method for large language models that reformulates generation tasks into token-level predictions, improving the models' ability to assess and selectively generate content more reliably.

Contribution

It proposes a novel token-level self-evaluation approach that leverages LLMs' calibration, enhancing content quality assessment and selective generation capabilities.

Findings

01

Self-evaluation scores improve accuracy in content assessment.

02

Self-evaluation correlates better with overall content quality.

03

Method outperforms likelihood-based metrics in selective generation.

Abstract

Safe deployment of large language models (LLMs) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. While likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by LLMs as reliable indicators of generation quality. Conversely, LLMs have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. In this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage LLMs' superior calibration at the token level. We instruct an LLM to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Dropout · 15 Ways to Contact How can i speak to someone at Delta Airlines · Layer Normalization · Residual Connection · Weight Decay