OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

Unggi Lee; Sookbun Lee; Heungsoo Choi; Jinseo Lee; Haeun Park; Younghoon Jeon; Sungmin Cho; Minju Kang; Junbo Koh; Jiyeong Bae; Minwoo Nam; Juyeon Eun; Yeonji Jung; Yeil Jeong

arXiv:2601.13882·cs.CL·January 21, 2026

OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

Unggi Lee, Sookbun Lee, Heungsoo Choi, Jinseo Lee, Haeun Park, Younghoon Jeon, Sungmin Cho, Minju Kang, Junbo Koh, Jiyeong Bae, Minwoo Nam, Juyeon Eun, Yeonji Jung, Yeil Jeong

PDF

Open Access

TL;DR

OpenLearnLM Benchmark introduces a comprehensive, theory-grounded framework for evaluating educational large language models across knowledge, skills, and attitude dimensions, addressing limitations of existing narrow benchmarks.

Contribution

It presents a unified, multi-dimensional evaluation framework grounded in learning sciences, with a large, diverse dataset and novel assessment methods for educational LLMs.

Findings

01

Grok-4.1-fast excels in knowledge but has alignment issues.

02

Claude-Opus-4.5 performs well in practical skills.

03

No single model dominates all evaluation dimensions.

Abstract

Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom's taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic's Alignment Faking methodology to detect behavioral inconsistency under varying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification · Educational Assessment and Pedagogy