Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

Yuxuan Zhou; Xien Liu; Chenwei Yan; Chen Ning; Xiao Zhang; Boxun Li; Xiangling Fu; Shijin Wang; Guoping Hu; Yu Wang; Ji Wu

arXiv:2506.08349·cs.CL·June 11, 2025

Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

Yuxuan Zhou, Xien Liu, Chenwei Yan, Chen Ning, Xiao Zhang, Boxun Li, Xiangling Fu, Shijin Wang, Guoping Hu, Yu Wang, Ji Wu

PDF

Open Access 1 Repo

TL;DR

This study introduces a multi-cognitive-level evaluation framework based on Bloom's Taxonomy to assess LLMs' medical knowledge and problem-solving abilities, revealing performance drops at higher cognitive levels.

Contribution

It proposes a novel multi-cognitive-level evaluation framework for medical LLMs and systematically assesses multiple models across these levels, highlighting performance challenges.

Findings

01

Performance declines as cognitive complexity increases.

02

Model size impacts higher-level cognitive performance more.

03

Insights for improving LLMs in real-world medical tasks.

Abstract

Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thumlp/multicogeval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning · Byte Pair Encoding · Softmax · Linear Layer · Dropout