Investigating Concept Alignment Using Implausible Category Members
Sunayana Rane, Brenden M. Lake, Thomas L. Griffiths

TL;DR
This paper explores how AI models understand concepts by testing their responses to implausible category members, revealing differences from human categorization that impact AI safety.
Contribution
It introduces a novel method of probing AI concept boundaries using implausible examples and compares model responses to human judgments.
Findings
Models differ from humans in categorizing words as vehicles or clothing.
Models often misclassify vegetables as fruits.
Misalignments in concept understanding can lead to safety issues in AI behavior.
Abstract
Developing AI systems with a human-like understanding of everyday concepts is a key step towards developing safe, reliable systems whose behavior makes sense to humans. When probing concept understanding, asking questions about plausible category members (e.g., "Is a car a vehicle?") is likely to recall patterns in the model's vast training data. We pursue an alternative strategy, characterizing the boundaries of conceptual categories by asking about implausible category members (e.g., "Is an olive a vehicle?") to probe the kind of concept-level knowledge we take for granted in fellow humans. We characterize concept boundaries for a set of fundamental concepts by studying AI systems' assignments of objects to superordinate categories from a classic psychological study by Rosch and Mervis, as well as their assignments of the same objects to mismatched superordinate categories. We compare…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
