TL;DR
This paper examines how the popularity of knowledge influences large language models' ability to recognize their knowledge boundaries, proposing methods to improve confidence calibration and answer correctness prediction.
Contribution
It introduces a novel analysis of knowledge popularity's impact on LLMs and proposes leveraging popularity signals for better confidence calibration and boundary perception.
Findings
LLMs perform better on more popular knowledge with higher confidence.
Relation popularity has the strongest correlation with LLMs' performance.
Using popularity signals improves answer correctness prediction accuracy by 5.24%.
Abstract
Large language models (LLMs) often fail to recognize their knowledge boundaries, producing confident yet incorrect answers. In this paper, we investigate how knowledge popularity affects LLMs' ability to perceive their knowledge boundaries. Focusing on entity-centric factual question answering (QA), we quantify knowledge popularity from three perspectives: the popularity of entities in the question, the popularity of entities in the answer, and relation popularity, defined as their co-occurrence frequency. Experiments on three representative datasets containing knowledge with varying popularity show that LLMs exhibit better QA performance, higher confidence, and more accurate perception on more popular knowledge, with relation popularity having the strongest correlation. Cause knowledge popularity shows strong correlation with LLMs' QA performance, we propose to leverage these signals…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The experimental setup is clear and easy to understand. 2. The analysis of correlations between knowledge popularity and model confidence is novel. There are some interesting insights. 3. There is potential in using popularity as a feature for confidence calibration.
1. The baselines for confidence calibration are pretty weak. For example, Kadavath et al., 2022 seems like a good candidate on top of PC? Also, since you are training the MLPs, I think you should consider implementing some baselines in paragraph “Confidence Estimation via LLM Internal States.” in related work. 2. The presentation of analysis can be a bit hard to follow. It is unclear what are the most important takeaways. Maybe it would be good to summarize important findings in the conclusion.
Overall, while the overall findings of this work is quite predictable (we have known them from Mallen et al. and other related works) I tend to like this work as it extends the prior datasets along domains which will be super-useful for the field. Despite my positivity, one may argue that this paper should be published in a resource-centric venue which may be a fair argument.
Like I said in the previous response, I think the main weakness of this work is that the overall findings of this work is quite predictable (we have known them from Mallen et al. and other related works). So the main sales pitch is additional resources. Is that enough to warrant an ICLR paper? Unsure but I am leaning to say that it's not enough.
- The idea to explore the alignment of model confidence and its actual performance is intriguing. - The paper offers an interesting finding regarding model hallucination: the incorrect answers often have better popularity than ground truth answers. - Popularity-based calibration is simple and effective in improving model accuracy.
- The paper fails to acknowledge an important prior work by Kandpal et al. [1], which first studied the relationship between co-occurrence frequency and a model’s factual accuracy, and demonstrated the correlation and causation between the two. - The analysis is mostly correlational. No causal interventions are taken to prove it is indeed popularity that causes the variation in accuracy and confidence. - The MLP's simple structure likely makes it prone to performing poorly on imbalanced training
- The study methodically measures knowledge popularity through three distinct lenses: the entity within the question, the entity within the answer, and the co-occurrence (relation) of both. This enables an examination of how popularity influences the performance and confidence of LLMs. - Leveraging these popularity metrics as signals, the research enhances answer correctness prediction accuracy by an average of 5.24% across all evaluated models and datasets. - It also proposes a feasible techniq
- A significant practical limitation of the proposed calibration approach is its dependence on the Probabilistic Confidence metric. While PC was not introduced in this paper, the entire analysis and the resulting calibration technique are fundamentally restricted to models that allow users access to token probabilities. This dependency is a major obstacle for real-world scenarios involving black-box API-based LLMs. The paper should explicitly mention this limitation. - Another critical point is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
