CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

Ruiyao Xu; Mihir Parmar; Tiankai Yang; Zhengyu Hu; Yue Zhao; and Kaize Ding

arXiv:2604.17501·cs.CL·April 21, 2026

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

Ruiyao Xu, Mihir Parmar, Tiankai Yang, Zhengyu Hu, Yue Zhao, and Kaize Ding

PDF

TL;DR

CoAct is a framework that combines self-rewarding and active learning with human-AI collaboration to improve LLM alignment, achieving significant performance gains on reasoning benchmarks.

Contribution

It introduces a novel synergy of self-rewarding and active learning for preference learning, leveraging self-consistency and oracle feedback to enhance LLM training.

Findings

01

Achieves +13.25% on GSM8K

02

Achieves +8.19% on MATH

03

Achieves +13.16% on WebInstruct

Abstract

Learning from preference-based feedback has become an effective approach for aligning LLMs across diverse tasks. However, high-quality human-annotated preference data remains expensive and scarce. Existing methods address this challenge through either self-rewarding, which scales by using purely AI-generated labels but risks unreliability, or active learning, which ensures quality through oracle annotation but cannot fully leverage unlabeled data. In this paper, we present CoAct, a novel framework that synergistically combines self-rewarding and active learning through strategic human-AI collaboration. CoAct leverages self-consistency to identify both reliable self-labeled data and samples that require oracle verification. Additionally, oracle feedback guides the model to generate new instructions within its solvable capability. Evaluated on three reasoning benchmarks across two model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.