Loading paper
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections | Tomesphere