A mixed policy to improve performance of language models on math problems
Gang Chen

TL;DR
This paper introduces a mixed policy reinforcement learning approach with a two-level token exploration strategy to enhance the accuracy of language models on math problems, demonstrating over 2% performance improvement on GSM8K.
Contribution
It proposes a novel two-level token exploration policy combining probabilistic and deterministic methods for math problem solving in language models.
Findings
Achieved over 2% performance gain on GSM8K dataset.
Demonstrated effectiveness of mixed policy exploration in math reasoning.
Implemented a two-level token exploration strategy for improved accuracy.
Abstract
When to solve math problems, most language models take a sampling strategy to predict next word according conditional probabilities. In the math reasoning step, it may generate wrong answer. Considering math problems are deterministic, we propose a mixed policy exploration approach to solve math problems with reinforcement learning. In peculiar, we propose a two level token exploration policy: the abstract level explores next token with probability and the second level is deterministic. Specifically, the abstract level policy will decide whether the token is operator or operand with probability sampling, while the second level is deterministic to select next token with the highest score in a greedy way. We test our method on GSM8K dataset with GPT-2 model, and demonstrate more than performance gain. Our implementation is available at https://github.com/vividitytech/math_lm_rl.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Weight Decay · Discriminative Fine-Tuning · Residual Connection · Adam
