TL;DR
GFPO is a training method for large language models that reduces verbosity by filtering responses based on length and token efficiency, leading to more concise reasoning without sacrificing accuracy.
Contribution
The paper introduces GFPO, a novel training approach that filters responses during training to promote concise reasoning, and proposes Adaptive Difficulty GFPO for better resource allocation.
Findings
GFPO reduces response length inflation by 46-71% while maintaining accuracy.
Optimizing for reward per token further reduces length inflation to 71-85%.
Increased training compute leads to less test-time compute, improving efficiency.
Abstract
Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length--inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely "filler": repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per…
Peer Reviews
Decision·ICLR 2026 Poster
- Simple, elegant, but novel (I think) idea - High-validity experiment by testing against an existing baseline with everything else held equal - Clear demonstration of improvement in performance - Lengthy and detailed analysis
I am pretty convinced by the results, but it would be nice to see the experiment replicated for at least one other model so we can know it’s not a fluke.
1. The intuition of sampling more and using more computational resources on hard questions is straightforward and reasonable. 2. The experiment shows the effectiveness of GFPO in different tasks, including mathematical, STEM, and coding reasoning. 3. The method itself is simple and easy to plug into any other RL post-training framworks.
1. The evaluated model size and model family are limited, which only contain the 14B Phi-4 model. 2. Considering this method is based on GRPO, maybe an analysis of training stability is needed.
1.GFPO’s core idea of in-training filtering is clear and easy to implement. By sampling a larger pool of responses and selectively training on the best subset, it avoids the complexities of explicit reward engineering. Framing this as implicit reward shaping provides an intuitive and generalizable way to steer model behavior toward desirable attributes like conciseness. 2. The paper demonstrates a rare and highly desirable outcome: improving inference efficiency while maintaining or even enhanc
1. The Adaptive Difficulty GFPO relies on the average reward of sampled responses to estimate problem difficulty. While conceptually clever, this heuristic can be unstable and noisy, especially early in training or on high-variance problems. Randomly poor rollouts may cause easy instances to be misclassified as “hard,” leading to inefficient resource allocation. 2. The method’s success hinges on the retention fraction k/G, which determines how aggressively responses are filtered. Although the p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
