An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics
Xincheng Liu

TL;DR
This study systematically evaluates AI-generated high-school physics lesson plans across different models and prompt frameworks, focusing on pedagogical soundness, usability, and alignment with standards using automated metrics.
Contribution
It introduces a comprehensive evaluation framework for AI-generated lesson plans, comparing multiple models and prompt structures to identify optimal configurations for educational use.
Findings
DeepSeek produced the most readable lesson plans.
RACE prompt framework yielded the lowest hallucination index.
Combining readability-optimized models with RACE and explicit checklists enhances lesson plan quality.
Abstract
This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
