Physics of Language Models: Part 3.2, Knowledge Manipulation
Zeyuan Allen-Zhu, Yuanzhi Li

TL;DR
This paper examines the ability of language models to manipulate stored knowledge across tasks like retrieval, classification, comparison, and inverse search, revealing significant limitations even in large models like GPT-4.
Contribution
It provides a controlled experiment demonstrating that language models inherently struggle with knowledge manipulation tasks, highlighting fundamental weaknesses despite extensive training.
Findings
Models excel in knowledge retrieval but struggle with classification and comparison.
Performance in inverse search is nearly zero, regardless of prompts.
Weaknesses are inherent, not due to model size or training data.
Abstract
Language models can store vast factual knowledge, yet their ability to flexibly use this knowledge for downstream tasks (e.g., via instruction finetuning) remains questionable. This paper investigates four fundamental knowledge manipulation tasks: retrieval (e.g., "What is person A's attribute X?"), classification (e.g., "Is A's attribute X even or odd?"), comparison (e.g., "Is A greater than B in attribute X?"), and inverse search (e.g., "Which person's attribute X equals T?"). We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks unless Chain of Thoughts (CoTs) are employed during both training and inference. Moreover, their performance in inverse knowledge search is virtually 0%, regardless of the prompts. Our primary contribution is a controlled, synthetic experiment that confirms these weaknesses are…
Peer Reviews
Decision·ICLR 2025 Poster
1. The experiments are well-controlled, focusing on minimizing extraneous variables, which allows the study to provide numerous insights (Results 1-8). Unlike most research on LLM knowledge, which often lacks such control and tends to use pre-trained LLMs, this study stands out as an important contribution.
1. The paper includes four knowledge manipulation tasks, but the reason for putting all four together in this single paper is not entirely clear. While the performance of each individual task is shown and discussed, it would be helpful if the authors discussed the overall implications of these results, i.e., what can be concluded when combining these findings. 2. While performance for each manipulation task is discussed, there is no analysis on why the results are as they are. For instance, why
1. The flaws of LLMs highlighted in this work are fundamental and significant. The testbed and methodology presented here could serve as a valuable benchmark for future generations of LLMs, potentially marking a clear boundary between AI systems that possess system II reasoning capabilities and those that do not. 2. The evaluation is reasonably thorough, including various types of knowledge manipulation tasks and examining both open and closed-source LLMs.
1. It's not a new finding that LLMs struggle to retrieve and apply stored knowledge effectively for solving reasoning tasks [1, 2]. However, the author doesn't cite these existing work. 2. The author reports a new finding that LLMs trained with CoT are not better at knowledge manipulation. However, examining the training samples (L360-362) reveals that the CoT format used is incomplete as it includes only intermediate answers without the full reasoning chain, which may have hindered the effect.
- The paper is thorough and self-contained, re-explaining key concepts, which enhances readability and makes the study easy to follow. - The writing is clear, and methodological points are frequently illustrated with examples, adding clarity to the analysis. - The use of synthetic datasets to examine LLMs within a controlled environment is a valuable and relevant methodological choice. - By testing multiple models, including recent versions like GPT-4o, the paper strengthens the generalizabilit
Figures 3, 4, and 6 are challenging to interpret, particularly due to the color coding and the distinctions between various lines (e.g., bioS multi5 + permute vs. bioS multi5 + permute + fullname) and the variations in behavior among them. Reducing the number of results displayed and focusing on a deeper analysis of selected findings could enhance clarity.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Algorithms · Explainable Artificial Intelligence (XAI)
MethodsAttention Is All You Need · Residual Connection · Adam · Dropout · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Position-Wise Feed-Forward Layer
