Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts

Xianwei Cao; Dou Quan; Zhenliang Zhang; Shuang Wang

arXiv:2603.22813·cs.AI·March 25, 2026

Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts

Xianwei Cao, Dou Quan, Zhenliang Zhang, Shuang Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Dynamic Preference Inference (DPI), a framework enabling agents to adapt to changing preferences in decision-making tasks by inferring latent preferences and conditioning policies accordingly, outperforming static approaches.

Contribution

The paper proposes DPI, a novel probabilistic framework that infers and adapts to shifting preferences in sequential decision-making, integrating it with a preference-conditioned actor-critic.

Findings

01

DPI outperforms fixed-weight baselines in environments with changing objectives.

02

DPI effectively infers latent preferences during regime shifts.

03

DPI improves post-shift performance across various environments.

Abstract

Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

The paper considers an interesting under-explored area. There is strong empirical evidence that humans switch their stated goals over time. There is value in understanding this phenomenon and in designing algorithms that support humans despite these changes.

Weaknesses

The motivation for the paper is unclear (see questions #1, #2, #3 below). It is also unclear how to apply it to a practical problem (questions #1, #2, #4). The experiments should have included stronger RL-based baselines that maximize the performance metrics reported (question #5).

Reviewer 02Rating 2Confidence 4

Strengths

- The paper identifies an interesting problem that may not be fully aware of in the machine learning literature. This problem itself has connections with reinforcement learning and MDPs. - The proposed framework is quite interesting, and the Bayesian view of this problem is natural to follow. I like how the value shift is integrated in the framework, and I also feel that using the sample example for explaining the framework helps a lot.

Weaknesses

- The paper starts with a lot of terminology but lacks an explanation of these terms. Although I'm quite familiar with the choice model/psychological literature, it is hard to tell if these terms are made up/invented by the authors. I strongly suggest avoiding overuse or misuse of terminology and approaching from an easy-to-understand tone. Besides, proper references for the well-defined terms are needed. -- I pointed out some points in the Questions section, but there are many more confusing po

Reviewer 03Rating 4Confidence 4

Strengths

- Novel Problem Formulation: The paper identifies and formalizes a significant gap in current MORL research: treating preference weights as latent and dynamic states that must be inferred online, rather than static inputs. This is well-motivated by cognitive theories of human decision-making. - Principled Framework: The DPI method offers a principled probabilistic approach, synthesizing variational inference for preference estimation (using ELBO) with established preference-conditioned RL techn

Weaknesses

- Limited Experimental Scope: The evaluation is restricted to two relatively simple synthetic environments: a symbolic queue task and a 2D grid-world. While illustrative, these do not fully demonstrate the method's scalability to complex, high-dimensional, or continuous control problems often seen in practical MORL settings. - Potentially Weak Oracle Baseline: The ENVELOPE baseline, described as having "oracle access to event-dependent preference weights" , performs surprisingly poorly (e.g., o

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDecision-Making and Behavioral Economics · Embodied and Extended Cognition · Reinforcement Learning in Robotics