We Need Knowledge Distillation for Solving Math Word Problems
Zhenquan Shen, Xinguo Yu, Xiaotian Cheng, Rao Peng, Hao Ming

TL;DR
This paper demonstrates that knowledge distillation can effectively compress large language models for math word problem solving, maintaining high accuracy while significantly reducing computational costs, thus benefiting educational applications.
Contribution
It introduces a method to compress LLMs for MWPs via vector distillation, preserving performance and generalizability, and reveals key linguistic features influencing compressibility.
Findings
Student model retains ~90% of teacher performance
Model is task-agnostic and generalizes well across MWPs
Part-of-speech info is crucial for MWP compressibility
Abstract
The enhancement of mathematical capabilities in large language models (LLMs) fosters new developments in mathematics education within primary and secondary schools, particularly as they relate to intelligent tutoring systems. However, LLMs require substantial computational resources, resulting in significant costs in educational contexts. To mitigate this drawback, this paper investigates the feasibility of compressing LLMs for solving math word problems (MWPs). We compress the embedded vectors encoded by BERT and distill a considerably smaller student model. Our findings indicate that the student model can maintain nearly 90% of the performance of the teacher model while utilizing only 1/12 of its parameters. In addition to achieving high accuracy, the model exhibits strong generalizability, as the compressed vectors perform well across all tasks related to MWPs, and the distillation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
