Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving
Yuxuan Zhou, Xien Liu, Chenwei Yan, Chen Ning, Xiao Zhang, Boxun Li, Xiangling Fu, Shijin Wang, Guoping Hu, Yu Wang, Ji Wu

TL;DR
This study introduces a multi-cognitive-level evaluation framework based on Bloom's Taxonomy to assess LLMs' medical knowledge and problem-solving abilities, revealing performance drops at higher cognitive levels.
Contribution
It proposes a novel multi-cognitive-level evaluation framework for medical LLMs and systematically assesses multiple models across these levels, highlighting performance challenges.
Findings
Performance declines as cognitive complexity increases.
Model size impacts higher-level cognitive performance more.
Insights for improving LLMs in real-world medical tasks.
Abstract
Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning · Byte Pair Encoding · Softmax · Linear Layer · Dropout
