Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
Xiyan Fu, Wei Liu

TL;DR
This paper explores using outcome-level reinforcement learning to enhance compositional generalization in models, showing improvements over traditional supervised fine-tuning on various benchmarks.
Contribution
It introduces a reinforcement learning approach with outcome-based rewards to better capture global compositional structure for generalization.
Findings
Reinforcement learning outperforms supervised fine-tuning in compositional benchmarks.
RL reshapes output distribution, aiding complex composition generalization.
Supervised models tend to overfit frequent compositions, unlike RL models.
Abstract
Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
