Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Yang Li, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen

TL;DR
This paper introduces OPT-BENCH, a comprehensive framework for training and evaluating large language models on NP-hard optimization problems using quality-aware reinforcement learning, emphasizing solution quality and generalization.
Contribution
It presents a new benchmark and training infrastructure for NP-hard problems, incorporating quality-aware rewards and diverse tasks to improve LLM optimization capabilities.
Findings
Training with quality-aware rewards improves solutions by 28.8%.
OPT-BENCH outperforms GPT-4o in success rate and quality ratio.
Task diversity enhances generalization more than data quantity.
Abstract
Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
