MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

Baraa Hikal; Mohamed Basem; Islam Oshallah; Ali Hamdi

arXiv:2505.18549·cs.CL·May 27, 2025

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi

PDF

1 Video

TL;DR

This paper introduces MSA-MathEval, a unified instruction-tuned language model with a disagreement-aware ensemble approach, achieving top performance in multi-dimensional evaluation of LLMs as math tutors.

Contribution

It presents a scalable, task-agnostic training pipeline and a disagreement-aware inference strategy for robust multi-dimensional LLM evaluation.

Findings

01

Achieved 1st place in Providing Guidance

02

Ranked 3rd in Actionability

03

Placed 4th in Mistake Identification and Mistake Location

Abstract

We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors· underline