Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS)
Henrik Stelling, Armin Kraus, Gerrit Grieb, Ibrahim Güler

TL;DR
This study evaluates how well large language models perform on a German medical school aptitude test that measures reasoning and abstraction, finding that their performance is limited and inconsistent.
Contribution
The study introduces a multi-run evaluation framework for assessing LLMs on cognitive aptitude tests, revealing domain-specific limitations.
Findings
Mean accuracy on TMS items was significantly lower than on knowledge-based medical exams.
Open-source models performed comparably to proprietary models on TMS tasks.
Inter-run reliability varied, showing inconsistent performance across repeated evaluations.
Abstract
Background and Objectives: Large language models (LLMs) have demonstrated high performance on knowledge-based medical examinations but their capabilities on cognitive aptitude tests emphasizing reasoning and abstraction remain underexplored. The Test for Medical Studies (TMS), a German medical school admission test, provides a standardized framework to examine these capabilities. This study aimed to evaluate the performance and consistency of multiple LLMs on text-based and visual-analytic TMS items. Materials and Methods: Eight contemporary LLMs, comprising proprietary and open-source systems, were evaluated using a multi-run design on standardized TMS items spanning text-based and visual-analytic cognitive domains. Results: Mean accuracy remained substantially below levels typically reported for knowledge-based medical examinations, with marked performance differences between…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Education and Admissions · Artificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills
