Is your multimodal large language model a good science tutor?
Ming Liu, Liwen Wang, Wensheng Zhang

TL;DR
This paper develops a comprehensive framework to evaluate and improve multimodal large language models as science tutors, emphasizing teaching quality and educational effectiveness beyond mere accuracy.
Contribution
It introduces a rubric-based evaluation framework and a preference optimization method to enhance MLLMs' tutoring capabilities, focusing on educational alignment.
Findings
Strong problem-solving skills do not ensure high-quality tutoring.
Performance-guided optimization improves educational effectiveness.
The framework identifies both strong and weak tutors for targeted improvements.
Abstract
Multimodal large language models (MLLMs) demonstrate impressive performance on scientific reasoning tasks (e.g., ScienceQA). However, most existing benchmarks focus narrowly on the accuracy of the final answer while ignoring other metrics. In particular, when applying MLLMs to educational contexts, the goal is not only correctness but also the ability to teach. In this paper, we propose a framework that evaluates MLLMs as science tutors using a comprehensive educational rubric and a simulated student model that judges the teaching performance of the tutors. Given a list of candidate MLLM science tutors, we use rubric-based student judgments to produce a range of tutor performance scores, identifying both strong and weak tutors. Using the training section of the ScienceQA dataset, we then construct a data set of pairwise comparisons between the outputs of strong and weak tutors. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Focus
