MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
Dumitran Adrian Marius, Theodor-Pierre Moroianu, Buca Mihnea-Vicentiu

TL;DR
This paper introduces MateInfoUB, a bilingual multimodal dataset for testing LLMs in complex CS educational tasks, revealing their strengths and limitations in multilingual and multimodal contexts.
Contribution
It presents a novel bilingual, multimodal dataset based on high-level CS competition questions and systematically evaluates LLMs, highlighting their performance and challenges in educational settings.
Findings
LLMs perform variably across languages and modalities.
Language choice impacts LLM reasoning capabilities.
The dataset enables assessment of LLMs in realistic CS tasks.
Abstract
The rapid advancement of Large Language Models (LLMs) has transformed various domains, particularly computer science (CS) education. These models exhibit remarkable capabilities in code-related tasks and problem-solving, raising questions about their potential and limitations in advanced CS contexts. This study presents a novel bilingual (English-Romanian) multimodal (text and image) dataset of multiple-choice questions derived from a high-level computer science competition. A particularity of our dataset is that the problems are conceived such that some of them are easier solved using reasoning on paper, while for others writing code is more efficient. We systematically evaluate State of The Art LLMs on this dataset, analyzing their performance on theoretical programming tasks. Our findings reveal the strengths and limitations of current LLMs, including the influence of language choice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
