Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
Aadyot Bhatnagar, Peter M{\o}rch Groth, Ali Madani

TL;DR
This paper introduces STOMP, a novel offline reinforcement learning algorithm that uses smooth Tchebysheff scalarization to effectively optimize multiple conflicting objectives, demonstrated on protein engineering tasks.
Contribution
The paper develops STOMP, a new multi-objective offline RL method that overcomes linear scalarization limitations using smooth Tchebysheff scalarization, with empirical validation on protein datasets.
Findings
STOMP achieves the highest hypervolumes in 8 of 9 settings.
It outperforms state-of-the-art baselines in multi-objective protein optimization.
STOMP is robust and improves post-trained models for multi-attribute tasks.
Abstract
Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
