Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory
Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, Jing Tang

TL;DR
This paper reveals that large language models often retain correct knowledge even when giving incorrect answers, and introduces new metrics and insights to better evaluate and utilize their latent knowledge in question answering tasks.
Contribution
The study uncovers hidden correct knowledge in LLMs despite incorrect outputs, and proposes Hits@k as a new metric to evaluate this latent knowledge independently of answer surface form.
Findings
LLMs often have more factual knowledge than standard QA accuracy suggests.
Prompting strategies allowing 'unsure' outputs can suppress correct answers.
Hits@k effectively measures latent knowledge retention in LLMs.
Abstract
Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the model's parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or "unsure" answers. By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy. Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper introduces a clear and intuitive metric that captures latent knowledge beyond standard accuracy, offering a new perspective on model evaluation. - The analysis reveals that models often “know” more than they express, which challenges common assumptions about what low-confidence or incorrect outputs imply. - The experiments are extensive and well controlled, which show consistent trends across multiple model scales and factual datasets. - The study provides actionable insights for
- The paper does not provide a formal justification for why Hits@k should reflect internal knowledge rather than distributional coincidence, relying mainly on empirical correlations (Figure 3). - The improvement margins between Hits@k and standard accuracy are sometimes modest -- for example, less than 5% in several datasets (Table 2) -- which weakens the claim of large hidden knowledge reserves. - The evaluation focuses narrowly on factual recall and omits reasoning or multi-hop questions, so
- This paper catches an interesting finding that LLMs often maintain access to accurate information within their probability distributions over vocabulary tokens, and there is a systematic gap between knowledge storage and expression rather than simple knowledge absence. - It offers new insights into knowledge augmentation: instead of expanding knowledge, augmenting the ability to express existing knowledge is important and can be potentially very useful.
- Though it is an interesting finding, I still believe that LLMs are not knowledgeable even though they assign significant probability scores to tokens representing the correct information, since in real-world use cases, it is impractical to let LLMs generate multiple responses to each query. Therefore, I don't think Hits@k should be used for evaluation/rank models. - The proposed decoding algorithm can raise many safety or ethical concerns if deployed into general use cases, since in many real
- Clear motivation. The paper highlights an intuitively important gap between latent knowledge and surface-level generation in LLMs. - Empirical observations are easy to interpret. Hits@k and “unsure” filtering provide simple and intuitive diagnostic signals. - Readable paper structure. The writing is clear, and the experiments are straightforward to follow.
- The insight is not novel, as similar conclusions have long existed in perplexity-based evaluations, which already reflect that LLMs may assign high probability to correct tokens that are not selected in top-1 decoding. - The phenomenon is also well-known from rollout-based methods (e.g., multi-sampling, self-consistency, and RL trajectories), which routinely reveal correct answers in non-greedy decoding paths. - The paper lacks deeper analysis or actionable contribution, offering no explanat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
