Loading paper
RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment | Tomesphere