Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Asim Osman; Sasha Abramowitz; Mark Bergh; Ulrich Armel Mbou Sob; Ruan John de Kock; Omayma Mahjoub; Oussama Hidaoui; Noah De Nicola; Arnol Manuel Fokam; Felix Chalumeau; Daniel Rajaonarivonivelomanantsoa; Siddarth Singh; Refiloe Shabe; Juan Claude Formanek; Simon Verster Du Toit; Arnu Pretorius

arXiv:2605.13554·cs.LG·May 14, 2026

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit

PDF

TL;DR

This paper introduces CPPO, an on-policy contrastive reinforcement learning algorithm that learns goal-conditioned policies without reward functions, excelling in both continuous and discrete environments.

Contribution

The paper presents CPPO, the first on-policy contrastive RL method that directly derives policy advantages from contrastive Q-values, bridging a gap in existing CRL research.

Findings

01

CPPO outperforms previous CRL methods in 14 out of 18 tasks.

02

CPPO matches or exceeds PPO's performance in 12 out of 18 tasks.

03

CPPO works effectively in both continuous and discrete, single-agent and multi-agent environments.

Abstract

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.