Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning
Xinyi Wu, Yanhao Jia, Qinglin Zhang, Yiran Qin, Luwei Xiao, Shuai Zhao

TL;DR
This paper introduces PBLBench, a new benchmark for evaluating multimodal large language models in project-based STEM education, emphasizing complex reasoning and expert validation to improve AI support for teachers.
Contribution
The paper presents PBLBench, a comprehensive benchmark with expert-validated ground truth, to evaluate MLLMs in complex, real-world educational tasks, addressing current evaluation limitations.
Findings
Most advanced models achieve only 59% accuracy
PBLBench challenges models with domain-specific, long-context reasoning
The benchmark highlights significant gaps in current MLLM capabilities
Abstract
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsADaptive gradient method with the OPTimal convergence rate
