Evaluating Large Language Models on Computer Science University Exams in Data Structures
Edan Gabay, Yael Maoz, Jonathan Stahl, Naama Maoz, Abdo Amer, Orr Eilat, Hanoch Levy, Michal Kleinbort, Amir Rubinstein, Adi Haviv

TL;DR
This paper evaluates the performance of various large language models on university-level computer science data structure exam questions using a new TAU benchmark dataset.
Contribution
It introduces a new dataset for assessing LLMs on CS exams and compares multiple models' abilities in this context.
Findings
GPT-4o and Claude 3.5 outperform smaller models
LLMs show varying accuracy on multiple-choice questions
The benchmark reveals current LLM limitations in CS education tasks
Abstract
We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
