PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Patrick Haller; Fabio Barth; Jonas Golde; Georg Rehm; Alan Akbik

arXiv:2510.24792·cs.CV·November 13, 2025

PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Patrick Haller, Fabio Barth, Jonas Golde, Georg Rehm, Alan Akbik

PDF

1 Datasets

TL;DR

PISA-Bench is a multilingual, multimodal benchmark based on PISA tests, designed to evaluate vision-language models across six languages and various reasoning tasks, highlighting current models' limitations especially in non-English and complex reasoning scenarios.

Contribution

This paper introduces PISA-Bench, a high-quality, multilingual, and multimodal benchmark derived from PISA assessments, addressing limitations of existing datasets and enabling comprehensive evaluation of vision-language models.

Findings

01

Small models (<20B parameters) underperform on PISA-Bench.

02

Models show significant performance drops on non-English languages.

03

High error rates in spatial and geometric reasoning tasks.

Abstract

Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PisaBench/pisa-bench
dataset· 82 dl
82 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.