TL;DR
This paper introduces RLMR, a reinforcement learning approach that dynamically balances subjective writing quality and objective constraints, significantly improving creative writing performance in large language models.
Contribution
RLMR is the first method to combine subjective preferences with objective verification in online RL training for creative writing.
Findings
Improved instruction following from 83.36% to 86.65%.
Achieved a 72.75% win rate in manual evaluations.
Demonstrated effectiveness across models from 8B to 72B parameters.
Abstract
Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing methods find it difficult to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
