TL;DR
This paper introduces a new metric called Factual Robustness Score (FRS) to evaluate how stable factual knowledge in large language models is against decoding perturbations, emphasizing the generation process.
Contribution
The paper proposes a novel, principled approach to measure factual robustness based on token entropy and temperature sensitivity, addressing limitations of existing performance-based metrics.
Findings
Factual robustness varies significantly across models.
Larger models have higher FRS and better robustness.
Accuracy drops by about 60% with increased uncertainty.
Abstract
Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
