Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Ziru Chen; Dongdong Chen; Ruinan Jin; Yingbin Liang; Yujia Xie; Huan Sun

arXiv:2602.03806·cs.LG·February 4, 2026

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun

PDF

Open Access 2 Datasets

TL;DR

This paper introduces Cobalt, a novel offline-online reinforcement learning method that improves multi-turn code generation by combining trajectory collection with contextual bandit training, outperforming existing baselines.

Contribution

Cobalt is a new method that leverages offline trajectories and online bandit learning to enhance multi-turn code generation in large language models.

Findings

01

Cobalt outperforms two multi-turn online RL baselines.

02

Significant improvements in Pass@1 scores on LiveCodeBench.

03

Analysis of in-context reward hacking behaviors.

Abstract

Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics