DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
Shaobo Wang, Yujie Chen, Yafeng Sun, Wenjie Qiu, Zhihui Xie, Sihang Li, Yucheng Li, Huiqiang Jiang, Xingzhang Ren, Xuming Hu, Dayiheng Liu, Linfeng Zhang

TL;DR
DISA introduces an offline importance sampling method for distribution-matching RL in LLMs, improving diversity and performance by decoupling partition function estimation from policy learning.
Contribution
It proposes DISA, a novel approach that estimates the partition function offline via importance sampling, enabling better distribution matching in LLM-RL.
Findings
DISA matches or exceeds FlowRL on multiple benchmarks.
DISA outperforms reward-maximization baselines on math tasks.
DISA retains more strategy diversity than reward-maximization methods.
Abstract
Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
