TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

D. Sorokin; A. Kostin; L. Savchenko; G. Gusev; A.V. Savchenko

arXiv:2306.05905·cs.LG·May 22, 2026·1 cites

TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

D. Sorokin, A. Kostin, L. Savchenko, G. Gusev, A.V. Savchenko

PDF

1 Repo 3 Reviews

TL;DR

TreeDQN is a sample-efficient off-policy reinforcement learning method for combinatorial optimization that reduces training time and outperforms existing techniques on synthetic and practical tasks.

Contribution

It introduces TreeDQN, a novel off-policy RL approach with theoretical guarantees, improving training efficiency and performance over on-policy methods.

Findings

01

TreeDQN requires up to 10 times less training data.

02

It trains faster than on-policy methods.

03

It outperforms state-of-the-art techniques on ML4CO competition tasks.

Abstract

A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The paper presents an algorithm for solving Tree MDPs with the specific application to learning branching heuristics for branch and bound algorithms in the context of solving mixed integer linear programming problems. TreeDQN presents better results on some of the benchmark problems used in the paper.

Weaknesses

The presentation is *possibly* the paper's weakest point. The lack of clarity makes me wonder about the value of the value of the contributions of the paper. The main contribution of the paper, TreeDQN, is explained in a single paragraph in the main text. Since the text only states that the algorithms is an adaptation of Double Dueling DQN, I assume TreeDQN is a straightforward adaption of DQN to Tree MDPs. The paper builds on a couple of previous papers, which I had to skim over in order to un

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

Branching is a critical aspect of integer programming solvers, and the authors provide an interesting new contribution towards RL based methods for the design of branching rules. The new methods are shown to produce smaller branch-and-bound trees than previous RL based variable selection methods, making this work a promising advance in the “learning to branch” line of work.

Weaknesses

Section 2.2 “Tree MDP” needs way more explanation. It more or less assumes familiarity with the Tree MDP work of Scavuzzo et al., and a more self-contained exposition would be very helpful. The theoretical contribution is very hazy to me. Contraction in mean is not really well-motivated. Does the cited theorem (Jaakkola ‘93) apply to the setting of tree operators here? That seems like a nontrivial assumption that is missing justification. Rather than just including a theorem about contraction i

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

Inherently, the idea of modelling variable selection as a Tree-MDP is a great idea as it allows the incorporation of the branch-and-bound structure into the decision process. The modification of the loss function to stably regress towards the geometric mean is also clever and might prove useful even outside the learnt variable selection domain. In general, the presentation of the work is clean and easy to read.

Weaknesses

1. Perhaps the biggest limitation is the assumption that the upper bound has to be derivable from the current node or known ahead of time. The authors assert that this (as well as more intricate node selection policies) lead to at most a moderate distribution shift, but never demonstrate this effect. 2. Another concern is regarding the difficulty distribution of instances. Random instance generation has been known to generate significant amounts of trivial instances compared to real-world equiva

Code & Models

Repositories

dmitrysorokin/treedqn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Scheduling and Optimization Algorithms · Auction Theory and Applications