Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh; Xiaokun Wang; Anurag Garikipati; Madalina Ciobanu; Qingqing Mao; Ritankar Das

arXiv:2604.06385·cs.CL·April 9, 2026

Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao, Ritankar Das

PDF

TL;DR

This paper introduces a multi-stage reinforcement learning and supervised fine-tuning approach to significantly improve open-source LLMs' pedagogical knowledge, achieving state-of-the-art results on educational benchmarks.

Contribution

It presents a novel application-driven optimization pipeline combining RL and SFT, transforming mid-sized open-source LLMs into domain-specific pedagogical experts.

Findings

01

Achieved new SOTA on the Cross-Domain Pedagogical Knowledge Benchmark.

02

Models outperform larger proprietary systems like Gemini-3 Pro.

03

Demonstrated domain-specific optimization enhances model expertise.

Abstract

We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.