Scaling Item-to-Standard Alignment with Large Language Models: Accuracy, Limits, and Solutions
Farzan Karimi-Malekabadi, Pooya Razavi, Sonya Powers

TL;DR
This study evaluates the effectiveness of large language models in automating educational item-to-standard alignment, demonstrating high accuracy and proposing hybrid approaches to reduce manual review efforts.
Contribution
It introduces a novel application of LLMs for educational content alignment, showing their potential to improve efficiency while maintaining accuracy.
Findings
LLMs correctly identified alignment status in 83-94% of cases.
Pre-filtering candidate standards improves skill suggestion accuracy to over 95%.
Performance varies between subjects, with lower accuracy in reading.
Abstract
As educational systems evolve, ensuring that assessment items remain aligned with content standards is essential for maintaining fairness and instructional relevance. Traditional human alignment reviews are accurate but slow and labor-intensive, especially across large item banks. This study examines whether Large Language Models (LLMs) can accelerate this process without sacrificing accuracy. Using over 12,000 item-skill pairs in grades K-5, we tested three LLMs (GPT-3.5 Turbo, GPT-4o-mini, and GPT-4o) across three tasks that mirror real-world challenges: identifying misaligned items, selecting the correct skill from the full set of standards, and narrowing candidate lists prior to classification. In Study 1, GPT-4o-mini correctly identified alignment status in approximately 83-94% of cases, including subtle misalignments. In Study 2, performance remained strong in mathematics but was…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
