Developing a Tutoring Dialog Dataset to Optimize LLMs for Educational Use
Menna Fateen, Tsunenori Mine

TL;DR
This paper presents a cost-effective approach to developing educational tutoring systems by creating a synthetic dialog dataset to fine-tune smaller LLMs, achieving comparable performance to larger models in real-world scenarios.
Contribution
The study introduces a synthetic tutoring dialog dataset and demonstrates that fine-tuning smaller LLMs can match larger models' performance at lower costs.
Findings
Fine-tuned smaller LLMs perform on par with larger models in tutoring tasks.
Synthetic datasets can effectively train LLMs for educational applications.
Cost reduction achieved without sacrificing model effectiveness.
Abstract
Recent advances in large language models (LLMs) have shown promise for scalable educational applications, but their use in dialog-based tutoring systems remains challenging due to the need for effective pedagogical strategies and the high costs associated with expert-curated datasets. Our study explores the use of smaller, more affordable LLMs for one-on-one tutoring in the context of solving reading comprehension problems. We developed a synthetic tutoring dialog dataset, evaluated by human teachers, and fine-tuned a smaller LLM using this dataset. Furthermore, we conducted an interactive experiment comparing the performance of the fine-tuned model with a larger model in real-world tutoring scenarios. Our results show that the fine-tuned model performs on par with the larger model but at a lower cost, demonstrating a viable, cost-effective approach for implementing LLM-based tutoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Speech and dialogue systems · Topic Modeling
