Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
Jatin Bhusal, Nancy Mahatha, Aayush Acharya, and Raunak Regmi

TL;DR
This study evaluates multiple large language models for automated secondary mathematics assessment within a human-in-the-loop framework, highlighting architecture-compatibility issues and the potential for assistive support.
Contribution
It introduces a benchmarking framework assessing LLMs' effectiveness in competency-based education, emphasizing architecture compatibility over model size.
Findings
Gemini-based models achieved fair agreement with human assessment.
Larger Orion model showed no agreement, indicating architecture impacts performance.
LLMs are more suitable for assistive support than autonomous certification.
Abstract
As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models -- Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) -- and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
