Lever: Inference-Time Policy Reuse under Support Constraints
Ihor Vitenko, Noha Ibrahim, Sihem Amer-Yahia

TL;DR
This paper introduces Lever, a framework for inference-time policy reuse in reinforcement learning, which composes new policies offline from a library of pre-trained policies using behavioral embeddings and Q-value composition.
Contribution
Lever is the first end-to-end method for offline policy composition that balances performance and computational cost in support-limited regimes.
Findings
Inference-time composition can match or exceed training-from-scratch performance.
Performance depends critically on the coverage of available transitions.
Long-horizon dependencies pose fundamental limitations for offline reuse.
Abstract
Reinforcement learning (RL) policies are typically trained for fixed objectives, making reuse difficult when task requirements change. We study inference-time policy reuse: given a library of pre-trained policies and a new composite objective, can a high-quality policy be constructed entirely offline, without additional environment interaction? We introduce lever (Leveraging Efficient Vector Embeddings for Reusable policies), an end-to-end framework that retrieves relevant policies, evaluates them using behavioral embeddings, and composes new policies via offline Q-value composition. We focus on the support-limited regime, where no value propagation is possible, and show that the effectiveness of reuse depends critically on the coverage of available transitions. To balance performance and computational cost, lever proposes composition strategies that control the exploration of candidate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
