How hard is learning to cut? Trade-offs and sample complexity
Sammy Khalife, Andrea Lodi

TL;DR
This paper establishes fundamental lower bounds on the sample complexity for learning to select effective cuts in branch-and-cut algorithms, providing theoretical insights and empirical validation for the use of gap closed scores.
Contribution
It introduces the first lower bounds for the learning-to-cut framework, analyzing both scores and comparing them to upper bounds, with practical experiments on neural network-based cut selection.
Findings
Lower bounds match known upper bounds for neural networks
Gap closed score effectively reduces branch-and-cut tree size
Theoretical analysis applies to cuts from Simplex tableau
Abstract
In the recent years, branch-and-cut algorithms have been the target of data-driven approaches designed to enhance the decision making in different phases of the algorithm such as branching, or the choice of cutting planes (cuts). In particular, for cutting plane selection two score functions have been proposed in the literature to evaluate the quality of a cut: branch-and-cut tree size and gap closed. In this paper, we present new sample complexity lower bounds, valid for both scores. We show that for a wide family of classes that maps an instance to a cut, learning over an unknown distribution of the instances to minimize those scores requires at least (up to multiplicative constants) as many samples as learning from the same class function any generic target function (using square loss). Our results also extend to the case of learning from a restricted set…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well written: it illustrates its core ideas through intuitive examples and explanations without sacrificing technical details. - The bounds derived don't seem to be vacuous/uninformative. The proofs are also relying on (as far as I can tell) novel constructions that are specific to the problem. - To the best of my knowledge, these are the best known nontrivial bounds for this learning problem. - The paper also provides an experimental comparison of the learned cuts of a GNN compar
- In the proof of 3.2 (Around line 600), you set $a = \epsilon$ and use the Transfer Lemma. Doesn't that mean you take $$\text{fat} _{\mathcal{F} _{s,\sigma'}} (\frac{\epsilon}{\epsilon}) \geq \text{VCdim}(\mathcal{F}[n]) ?$$ However the transfer lemma holds for $\gamma \in (0,1/2)$ so I'm not sure you want to do it this way. Maybe set $a=c\epsilon$ for c at least 2? - How necessary is assumption 2 for your result? Are there any architectures that don't qualify? That's perhaps one part I'm a li
1. First sample complexity lower bounds for learning-to-cut, establishing theoretical foundations 2. Lower bounds hold for wide function classes and are nearly tight with upper bounds up to logarithmic factors 3. Theoretical equivalence between scores provides formal justification for using gap closed as proxy 4. Proof construction for shattering ILP instances is non-trivial and well-executed 5. Experiments directly test the core thesis about proxy effectiveness
1. Gap between $\Omega(1/\epsilon)$ lower bound and $\Omega(1/\epsilon^2)$ upper bound for tableau case 2. Assumptions 1 and 2 restrict generality of theoretical claims 3. Empirical validation uses small-scale problems in controlled environment and restricted solver configuration 4. Performance varies: on Facility Location, Efficacy heuristic (123.63) outperforms GNN (134.61) 5. Experimental setup disables key solver components (presolve, heuristics, default cuts) 6. Theoretical analysis confine
The problem of learning cut selection policies for the branch and cut framework is interesting and very practical, and this paper provides the first lower bounds on sample complexity for this problem. The findings that the gap-reduction metric is a suitable approximation to tree-size also useful. Overall I found the paper well written and feel like the main contributions were clearly communicated.
I would have liked for more of the key sample complexity lower bound argument to be sketched in the main body of the paper. For example, in the discussion of Proposition 3.4, it is stated that the approach ignores $n \times m + m$ of the inputs, and I was curious to understand why. There are also a number of minor typos throughout the paper, and a few significant ones in the experimental results (unless I have misunderstood something). I think at the beginning of Section 2.2, the authors could
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplexity and Algorithms in Graphs · Vehicle Routing Optimization Methods · Stochastic Gradient Optimization Techniques
MethodsGraph Neural Network · Sparse Evolutionary Training
