Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS)

Henrik Stelling; Armin Kraus; Gerrit Grieb; Ibrahim Güler

PMC · DOI:10.3390/ejihpe16020023·February 12, 2026

Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS)

Henrik Stelling, Armin Kraus, Gerrit Grieb, Ibrahim Güler

PDF

Open Access

TL;DR

This study evaluates how well large language models perform on a German medical school aptitude test that measures reasoning and abstraction, finding that their performance is limited and inconsistent.

Contribution

The study introduces a multi-run evaluation framework for assessing LLMs on cognitive aptitude tests, revealing domain-specific limitations.

Findings

01

Mean accuracy on TMS items was significantly lower than on knowledge-based medical exams.

02

Open-source models performed comparably to proprietary models on TMS tasks.

03

Inter-run reliability varied, showing inconsistent performance across repeated evaluations.

Abstract

Background and Objectives: Large language models (LLMs) have demonstrated high performance on knowledge-based medical examinations but their capabilities on cognitive aptitude tests emphasizing reasoning and abstraction remain underexplored. The Test for Medical Studies (TMS), a German medical school admission test, provides a standardized framework to examine these capabilities. This study aimed to evaluate the performance and consistency of multiple LLMs on text-based and visual-analytic TMS items. Materials and Methods: Eight contemporary LLMs, comprising proprietary and open-source systems, were evaluated using a multi-run design on standardized TMS items spanning text-based and visual-analytic cognitive domains. Results: Mean accuracy remained substantially below levels typically reported for knowledge-based medical examinations, with marked performance differences between…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

Gemini

Diseases3

injury to TMS LLMs

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Education and Admissions · Artificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills