Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Hessa A. Alawwad; Anas Zafar; Areej Alhothali; Usman Naseem; Ali Alkhathlan; Amani Jamal

arXiv:2506.21596·cs.CL·July 16, 2025

Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Hessa A. Alawwad, Anas Zafar, Areej Alhothali, Usman Naseem, Ali Alkhathlan, Amani Jamal

PDF

Open Access

TL;DR

This paper evaluates multimodal large language models on educational textbook question answering, revealing significant challenges in modality integration and context handling, and introduces a benchmark for future research.

Contribution

It provides the first comprehensive evaluation of state-of-the-art MLLMs on educational tasks, highlighting issues like catastrophic context interference and architectural differences.

Findings

01

Retrieved context improves text question performance but degrades diagram question accuracy.

02

Fine-tuning enhances LLaMA 3.2-Vision's multimodal performance, but LLaVA struggles with generalization.

03

Identifies key challenges in modality prioritization and context integration for MLLMs.

Abstract

Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested. This work presents the first evaluation of state-of-the-art MLLMs, including LLaVA-1.5 and LLaMA 3.2-Vision, on the textbook question answering (TQA) task using the CK12-QA dataset. We introduce a multimodal retrieval-augmented generation (RAG) pipeline to simulate real-world learning by providing relevant lesson paragraphs and diagrams as context. Our zero-shot experiments reveal a critical trade-off: while retrieved context improves LLaVA's performance on text-based questions, it significantly degrades the accuracy of the more powerful LLaMA 3.2-Vision on diagram-based tasks, dropping its validation accuracy from 74.07% to 25.93%. We term this statistically significant phenomenon "catastrophic context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Assessment and Pedagogy · Topic Modeling