Small, Private Language Models as Teammates for Educational Assessment Design
Chris Davis Jaldi, Anmol Saini, Shan Zhang, Noah Schroeder, Cogan Shimizu, Eleni Ilkou

TL;DR
This study compares small and large language models for educational assessment question design, highlighting SLMs' privacy advantages and their competitive quality, while noting evaluation inconsistencies.
Contribution
It systematically evaluates SLMs versus LLMs for assessment tasks, focusing on quality, reliability, and deployment considerations in educational settings.
Findings
SLMs perform competitively across pedagogical quality metrics.
Model-based evaluations show systematic biases compared to experts.
SLMs enable privacy-sensitive, local deployment for assessment design.
Abstract
Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
