BinaryPPO: Efficient Policy Optimization for Binary Classification
Punya Syon Pandey, Zhijing Jin

TL;DR
BinaryPPO introduces a reinforcement learning framework that reformulates binary classification as reward maximization, significantly improving accuracy over traditional supervised fine-tuning, especially in noisy or imbalanced data scenarios.
Contribution
The paper presents BinaryPPO, a novel offline RL method using reward shaping for robust binary classification, outperforming supervised methods across multiple benchmarks.
Findings
BinaryPPO achieves up to 99% accuracy.
It improves performance by 40-60 percentage points over baselines.
Reward shaping and policy stability are key to success.
Abstract
Supervised fine-tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real-world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain-specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40-60 percentage points, reaching up to 99%, substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
