Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

Eason Chen; Sophia Judicke; Kayla Beigh; Xinyi Tang; Isabel Wang; Nina Yuan; Zimo Xiao; Chuangji Li; Shizhuo Li; Reed Luttmer; Shreya Singh; Maria Yampolsky; Naman Parikh; Yvonne Zhao; Meiyi Chen; Scarlett Huang; Anishka Mohanty; Gregory Johnson; John Mackey; Jionghao Lin; and Ken Koedinger

arXiv:2602.18807·cs.HC·April 2, 2026

Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

Eason Chen, Sophia Judicke, Kayla Beigh, Xinyi Tang, Isabel Wang, Nina Yuan, Zimo Xiao, Chuangji Li, Shizhuo Li, Reed Luttmer, Shreya Singh, Maria Yampolsky, Naman Parikh, Yvonne Zhao, Meiyi Chen, Scarlett Huang, Anishka Mohanty, Gregory Johnson, John Mackey, Jionghao Lin

PDF

TL;DR

This study evaluates GPTutor, an LLM-based math tutoring system, comparing chatbot and embedded proof review tools, revealing that chatbot support alone may not enhance independent proof learning outcomes.

Contribution

It introduces and empirically compares two LLM-supported tutoring components, highlighting their differential impact on student learning in mathematics.

Findings

01

Higher chatbot usage linked to lower midterm scores.

02

Proof-review tool usage showed no negative association with performance.

03

Students with lower self-efficacy used both tools more frequently.

Abstract

We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.