A Markov Decision Process for Variable Selection in Branch & Bound
Paul Strang, Zacharie Al\`es, C\^ome Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson

TL;DR
This paper introduces BBMDP, a Markov Decision Process framework for variable selection in Branch & Bound algorithms, enabling reinforcement learning to improve MILP solving efficiency, with empirical validation showing superior performance over existing RL agents.
Contribution
The paper presents a novel MDP formulation for variable selection in B&B, facilitating the application of RL algorithms to learn better heuristics for MILP problems.
Findings
The BBMDP model outperforms previous RL agents on four MILP benchmarks.
Empirical results demonstrate improved B&B efficiency using the proposed MDP approach.
The formulation enables leveraging a broad range of RL algorithms for B&B heuristics.
Abstract
Mixed-Integer Linear Programming (MILP) is a powerful framework used to address a wide range of NP-hard combinatorial optimization problems, often solved by Branch and Bound (B&B). A key factor influencing the performance of B&B solvers is the variable selection heuristic governing branching decisions. Recent contributions have sought to adapt reinforcement learning (RL) algorithms to the B&B setting to learn optimal branching policies, through Markov Decision Processes (MDP) inspired formulations, and ad hoc convergence theorems and algorithms. In this work, we introduce BBMDP, a principled vanilla MDP formulation for variable selection in B&B, allowing to leverage a broad range of RL algorithms for the purpose of learning optimal B\&B heuristics. Computational experiments validate our model empirically, as our branching agent outperforms prior state-of-the-art RL agents on four…
Peer Reviews
Decision·NeurIPS 2025 poster
**Strengths** - The paper studies an interesting and important problem. B&B algorithms are widely used for combinatorial optimization and any improvements here is likely to be of significant interest and have a large impact. - The paper is well written. The problem formulation is clearly described. The description of prior work is very good. - The proposed approach is intuitively clear and mathematically rigorous. The optimization problem is well defined and the reformulation to stochastic sh
**Strength**: * Enables the use of RL frameworks within B&B solvers. * Models the B&B process as a temporal decision-making problem, allowing for the incorporation of time-dependent strategies (e.g., node selection with early stop). * BBMDP demonstrates strong empirical performance, outperforming other RL-based approaches in benchmark tests. **Weakness**: 1. The presentation of the work is hard to follow when a nomenclature is not present. Since there are symbols with both B&B and RL their def
Attempts to use RL for variable selection in MILP branch-and-bound solving has not yet reached their potential. The paper well situates that in principle RL can surpass supervised learning, namely imitation learning from strong branching. A difficulty has been formulating a MDP for the RL task that has the markov property. Etheve et al in 2020 made an not entirely satisfactory effort, which was formalised by Scavuzzo et al in 2022 as the TreeMDP, accompanied by a careful empirical study. Thi
Strengths: A Principled Formulation. The paper’s primary strength is the introduction of the “Branch & Bound Markov Decision Process” (BBMDP). This is a significant contribution, advancing the field from “MDP-inspired” but technically flawed frameworks (e.g., TreeMDP) to a principled, standard MDP formulation that is theoretically sound. Insightful Analysis. The paper provides a clear and insightful diagnosis of a core issue with prior approaches in Section 3.2, “Misconceptions in Learning to Ex
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Constraint Satisfaction and Optimization · Vehicle Routing Optimization Methods
