# Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS)

**Authors:** Henrik Stelling, Armin Kraus, Gerrit Grieb, Ibrahim Güler

PMC · DOI: 10.3390/ejihpe16020023 · 2026-02-12

## TL;DR

This study evaluates how well large language models perform on a German medical school aptitude test that measures reasoning and abstraction, finding that their performance is limited and inconsistent.

## Contribution

The study introduces a multi-run evaluation framework for assessing LLMs on cognitive aptitude tests, revealing domain-specific limitations.

## Key findings

- Mean accuracy on TMS items was significantly lower than on knowledge-based medical exams.
- Open-source models performed comparably to proprietary models on TMS tasks.
- Inter-run reliability varied, showing inconsistent performance across repeated evaluations.

## Abstract

Background and Objectives: Large language models (LLMs) have demonstrated high performance on knowledge-based medical examinations but their capabilities on cognitive aptitude tests emphasizing reasoning and abstraction remain underexplored. The Test for Medical Studies (TMS), a German medical school admission test, provides a standardized framework to examine these capabilities. This study aimed to evaluate the performance and consistency of multiple LLMs on text-based and visual-analytic TMS items. Materials and Methods: Eight contemporary LLMs, comprising proprietary and open-source systems, were evaluated using a multi-run design on standardized TMS items spanning text-based and visual-analytic cognitive domains. Results: Mean accuracy remained substantially below levels typically reported for knowledge-based medical examinations, with marked performance differences between text-based and visual-analytic subtests. Open-source models performed competitively compared with proprietary systems. Inter-run reliability was heterogeneous, indicating notable variability across repeated evaluations. Conclusions: Current LLMs show limited and domain-dependent performance on cognitive aptitude tasks relevant to medical school admission. High accuracy on knowledge-based examinations does not translate into stable performance on aptitude tests emphasizing fluid intelligence. The observed modality-dependent performance patterns and inter-run variability highlight the importance of differentiated, multi-run evaluation strategies when assessing LLMs for applications in medical education.

## Full-text entities

- **Diseases:** injury to (MESH:D014947), TMS (MESH:C562543), LLMs (MESH:D007806)
- **Chemicals:** Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12939321/full.md

---
Source: https://tomesphere.com/paper/PMC12939321