VLM@school -- Evaluation of AI image understanding on German middle school knowledge
Ren\'e Peinl, Vincent Tischler

TL;DR
This paper presents a new German middle school curriculum-based benchmark dataset to evaluate Vision Language Models' ability to combine visual reasoning with factual knowledge, revealing current models' limitations in real-world, multilingual contexts.
Contribution
Introduces a novel multilingual benchmark dataset based on real middle school curricula to assess VLMs' reasoning capabilities beyond artificial English-language tasks.
Findings
Models achieve less than 45% accuracy overall.
Poor performance in music, mathematics, and adversarial questions.
Significant gap between benchmark performance and real-world understanding.
Abstract
This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
