Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
Michael Hardy, Yunsung Kim

TL;DR
This paper investigates the misalignment between large language models' performance on benchmarks and their effectiveness in real-world educational impacts, revealing shared biases and challenges in aligning models with intended outcomes.
Contribution
It introduces methods to measure LLM alignment with complex, high-noise tasks and uncovers that common pretraining largely explains observed misalignments.
Findings
LLMs' behaviors correlate more across models than with human experts.
Shared biases in LLMs often negatively impact intended learning outcomes.
Ensemble methods can worsen misalignment with educational goals.
Abstract
LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study contrasts LLM alignment on benchmarks, downstream tasks, and, importantly the intended impact of those tasks. We evaluate the performance of leading LLMs (i.e., generative pre-trained base models) on difficult-to-verify tasks of the teaching and learning of schoolchildren. Across all LLMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often negatively aligned with the intended impact of student learning outcomes. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that selection of LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
