Loading paper
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning | Tomesphere