# Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability

**Authors:** Rahul Meshram, Kesav Kaza

arXiv: 2509.00415 · 2025-09-03

## TL;DR

This paper studies multi-action partially observable restless bandits, proposing Lagrangian relaxation techniques, heuristic policies, and analyzing indexability to address complex decision-making under uncertainty with multiple actions.

## Contribution

It introduces a Lagrangian relaxation approach for multi-action POMDPs, develops approximation methods, and evaluates heuristic policies and indexability in this complex setting.

## Key findings

- Lagrangian bounds can be approximated using PBVI and rollout policies.
- Heuristic policies show promising performance in complex multi-action POMDPs.
- Whittle index policies have limitations in the studied model.

## Abstract

Partially observable restless multi-armed bandits have found numerous applications including in recommendation systems, communication systems, public healthcare outreach systems, and in operations research. We study multi-action partially observable restless multi-armed bandits, it is a generalization of the classical restless multi-armed bandit problem -- 1) each bandit has finite states, and the current state is not observable, 2) each bandit has finite actions. In particular, we assume that more than two actions are available for each bandit. We motivate our problem with the application of public-health intervention planning. We describe the model and formulate a long term discounted optimization problem, where the state of each bandit evolves according to a Markov process, and this evolution is action dependent. The state of a bandit is not observable but one of finitely many feedback signals are observable. Each bandit yields a reward, based on the action taken on that bandit. The agent is assumed to have a budget constraint. The bandits are assumed to be independent. However, they are weakly coupled at the agent through the budget constraint.   We first analyze the Lagrangian bound method for our partially observable restless bandits. The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial. Hence, the computation of Lagrangian bounds is also challenging. We describe approximations for the computation of Lagrangian bounds using point based value iteration (PBVI) and online rollout policy. We further present various properties of the value functions and provide theoretical insights on PBVI and online rollout policy. We study heuristic policies for multi-actions PORMAB. Finally, we discuss present Whittle index policies and their limitations in our model.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00415/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/2509.00415/full.md

---
Source: https://tomesphere.com/paper/2509.00415