RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks

Mian Wu; Gavin Zhang; Sewon Min; Sergey Levine; Aviral Kumar

arXiv:2511.01758·cs.LG·November 4, 2025

RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks

Mian Wu, Gavin Zhang, Sewon Min, Sergey Levine, Aviral Kumar

PDF

Open Access 3 Reviews

TL;DR

RLAC introduces a dynamic adversarial critic framework using large language models to improve free-form generation tasks by efficiently identifying failure modes and enhancing output quality, reducing verification costs.

Contribution

This paper presents RLAC, a novel reinforcement learning approach with a dynamic adversarial critic that improves free-form generation by focusing on likely failure modes, enabling scalable and effective post-training optimization.

Findings

01

RLAC improves factual accuracy in text generation.

02

RLAC enhances correctness in code generation.

03

Dynamic critics outperform fixed critics in verification efficiency.

Abstract

Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Important and natural problems to tackle for rubric based reward modeling and RL training for LLM post training. - Theoretical formulation is solid. - Strong factual text generation results. - Good ablation studies. - 4 to 8 sentence comparison shows the method scales well to complexity - Presentation is clear

Weaknesses

- I have doubts about code experiment set up as mentioned above - Lack of theoretical or empirical analysis into the method and experiment results - Could benefit from more analysis & learnings and more experiments.

Reviewer 02Rating 4Confidence 4

Strengths

-The paper is well-written and easy to follow -Paper outperforms baselines methods in both factual text generation and code generation -The proposed approach is novel, applying adversarial learning to the important, significant problem of identifying failure cases in free-form generation -I find the use of DPO fairly interesting and novel

Weaknesses

-For fact verification, I do not necessarily see the need for a critic to specify which fact the check. Can one not separate out all the facts in the generated responses, either programmatically or with an LLM, and then run each one through the fact verifier? I understand from later on in the paper the verification is costly, but verifying each fact would provide more accurate rewards for the generator, correct? For code generation, I understand this simplification is not possible because the te

Reviewer 03Rating 4Confidence 3

Strengths

Post-training methods for large language models is a crucial task to improve task specific use of generative models and provide robust LLMs. The presentation of the paper is clear, the problem is well motivated and the overall description of the method is good. Experimental results demonstrate that RLDCF yields competitive results in both text and code generation quality, highlighting the effectiveness of adversarial critic feedback to finetune. - In the text generation experiment, RLDCF achie

Weaknesses

The results are interesting and promising on the two proposed tasks. However, as the paper is mostly experimental, I would expect more discussion on the choice of the methodological choices. For instance on the way the critic and generator are updated. The influence of K (candidate outputs for each instruction) or N (number of criteria sampled from the critic) should be strong on the results. Even if no theoretical guarantees are provided (which is a probably a very hard question), I would expec

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques