THiNK: Can Large Language Models Think-aloud?
Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski

TL;DR
THiNK is a multi-agent, feedback-driven framework based on Bloom's Taxonomy that systematically evaluates and enhances the reasoning skills of large language models through iterative reflection and refinement.
Contribution
We introduce THiNK, a novel evaluation framework that assesses and improves both lower- and higher-order thinking skills in LLMs using iterative problem generation, critique, and revision.
Findings
Models excel at lower-order thinking skills.
Models struggle with applying knowledge in realistic contexts.
Structured feedback improves higher-order reasoning.
Abstract
Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsALIGN
