Optimistic Task Inference for Behavior Foundation Models

Thomas Rupf; Marco Bagatella; Marin Vlastelica; Andreas Krause

arXiv:2510.20264·cs.LG·March 3, 2026

Optimistic Task Inference for Behavior Foundation Models

Thomas Rupf, Marco Bagatella, Marin Vlastelica, Andreas Krause

PDF

Open Access 3 Reviews

TL;DR

This paper introduces OpTI-BFM, a method that improves task inference in Behavior Foundation Models by actively interacting with the environment, leading to efficient zero-shot reward identification with minimal compute.

Contribution

It proposes an optimistic decision criterion for BFMs that models reward uncertainty and guides data collection, with theoretical regret bounds and empirical validation.

Findings

01

Enables BFMs to identify unseen rewards in few episodes

02

Provides regret bounds linked to upper-confidence algorithms

03

Achieves minimal compute overhead in zero-shot benchmarks

Abstract

Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 4

Strengths

- Tackles a significant and practical bottleneck. Online task inference with only a few active-interaction episodes is important for real-world applications. - Solid theoretical guarantees via connections to linear bandit algorithms. - Good empirical performance with additional experiments and analysis, e.g. episode-level updates and non-stationary rewards.

Weaknesses

- The paper operates under assumptions of a perfect BFM / successor feature model. It is unclear what would happen for the theoretical guarantees or how the empirical results would change with approximation errors. - The formal regret bound is proven for an episodic-update variant of OpTI-BFM. However, the experiments (Sec.5.3, Fig. 4) show that the step-update variant is empirically superior and converges much faster. While it is a positive result that the practical algorithm is even better,

Reviewer 02Rating 8Confidence 2

Strengths

The authors introduce a new framework for task inference in BFMs without labeled offline (state, rewards) data. In this framework, the relationship between BFM policy search and linear bandits is exploited to develop, and prove a regret bound for, the OpTI-BFM algorithm for online task inference. OpTI-BFM is timely and tackles the problematic requirement for labeled data with implications for many real world applications. The empirical results support the efficacy of OpTI-BFM in three standard z

Weaknesses

- Whilst the experiment section is quite strong already it could be improved further if the authors are able to show the performance of OpTI-BFM on an alternative environment to those of the DeepMind Control suite. e.g. an alternate task with pixel observations. - There is not much discussion of how OpTI-BFM could be deployed for real-world use. The authors say that their method would enable BFMs to work “beyond domains in which rewards are readily available”, but it is not obvious to me how it

Reviewer 03Rating 6Confidence 4

Strengths

- One of the core original ideas is the new task‑space bandit formulation for BFMs. The paper formulates online task inference as linear bandit optimization in the task-embedding space: with well-trained USFs, the expected episode return is approximately linear in the successor features of the policy conditioned on a task embedding, i.e., $\mathbb{E}[\hat{G}_k \mid s_0, \pi_z] \approx \langle \psi(s_0, z), z_r \rangle$. This lets the agent choose $z$ using a rule over confidence sets on the unkn

Weaknesses

- The theory assumes perfect USFs and that the policy conditioned on $z$ is (near) optimal for reward $z$ (A1), strictly linear rewards with sub-Gaussian noise (A2), and an optimization oracle for Eq.~(10) (A3) and are introduced before Algorithm1 on p.5. In practice, USFs are learned with function approximation and the acquisition is solved by random shooting. The paper notes ``we found OpTI-BFM to perform well even when [A1--A2] are violated'' but does not quantify robustness to misspecificati

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Digital Mental Health Interventions