Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Lifu Tu; Rui Meng; Shafiq Joty; Yingbo Zhou; Semih Yavuz

arXiv:2411.15993·cs.CL·September 26, 2025

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Lifu Tu, Rui Meng, Shafiq Joty, Yingbo Zhou, Semih Yavuz

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the factual accuracy of long-form text generated by large language models, analyzing how self-assessment scores relate to factuality and proposing a mathematical model to quantify this relationship.

Contribution

It introduces a new evaluation framework using Self-Known and Self-Unknown scores to assess factuality in long-form generation and links these scores with a mathematical model.

Findings

01

Higher Self-Known scores correlate with better factuality.

02

Higher Self-Unknown scores are associated with lower factuality.

03

Unsupported claims can increase without changing self-judgment scores.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

- The authors did extensive analysis to support our intuition that there tends to more factual errors in the latter part of the LLM's response. The authors show this applies to different models and also applies to RAG scenario.

Weaknesses

- The Finding where model tends to have more factuality errors in later part of the output is kind of accepted conclusion. So the findings is no surprising - The conclusion about the relationship between factuality score and Self-Known and Self-Unknown is not established. Firstly, I think there is something wrong when transiting from Eq (1) to Eq (2) and it is hard to understand the definition of "y_correct" and "y_wrong". Secondly, the assumption that there is no "over confidence" is not correc

Reviewer 02Rating 5Confidence 3

Strengths

The introduction of Self-Known and Self-Unknown scores provides a fresh perspective on evaluating LLMs' self-assessment capabilities, offering valuable insights into their strengths and limitations. The paper systematically analyzes multiple LLMs, covering different evaluation settings and utilizing both human and automated tools for factuality assessment. This extensive examination adds rigor to the study. By highlighting the consistent decline in factuality in long-form outputs, the research a

Weaknesses

1. The difference between Figure 1 (a) and Figure 1 (b) is not illustrated. It will be beneficial to add illustrations in caption. 2. The proposed method can perform evaluation to long-term generation by the model itself, but it is time-consuming to do evaluation for every atomic statement. The detection itself has a very strong trade-off between performance and inference costs. 3. There is not a way to adjust or convert the fact errors into correct facts. The proposed method only covers fact er

Reviewer 03Rating 3Confidence 3

Strengths

a. The problem explored in the paper is interesting, and worth further exploration. b. The authors estimate a factuality score based on Self-Known and Self-Unknown, which makes sense

Weaknesses

a. The presentation of the paper is confusing, the logic line of the paper is very unclear. For example, the paper proposes to use three ways to evaluate the self-known and self-unknown, however, no details is provided for all the methods. And in Section 4, an estimation of the factuality score is proposed, and in section 5, they mentioned that they measure the factuality score using FActScore, then which score do they used in the following experiments? b. As an analytical paper, it lacks insigh

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsDense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax · Attention Is All You Need