Evaluating Large Language Models on Computer Science University Exams in Data Structures

Edan Gabay; Yael Maoz; Jonathan Stahl; Naama Maoz; Abdo Amer; Orr Eilat; Hanoch Levy; Michal Kleinbort; Amir Rubinstein; Adi Haviv

arXiv:2604.23347·cs.CL·April 28, 2026

Evaluating Large Language Models on Computer Science University Exams in Data Structures

Edan Gabay, Yael Maoz, Jonathan Stahl, Naama Maoz, Abdo Amer, Orr Eilat, Hanoch Levy, Michal Kleinbort, Amir Rubinstein, Adi Haviv

PDF

TL;DR

This paper evaluates the performance of various large language models on university-level computer science data structure exam questions using a new TAU benchmark dataset.

Contribution

It introduces a new dataset for assessing LLMs on CS exams and compares multiple models' abilities in this context.

Findings

01

GPT-4o and Claude 3.5 outperform smaller models

02

LLMs show varying accuracy on multiple-choice questions

03

The benchmark reveals current LLM limitations in CS education tasks

Abstract

We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.