FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization
Shengyu Feng, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

TL;DR
FrontierCO introduces a comprehensive benchmark for evaluating machine learning methods on large-scale, real-world combinatorial optimization problems, revealing performance gaps and potential advantages over classical solvers.
Contribution
It provides a large-scale, real-world dataset and standardized evaluation framework for ML-based combinatorial optimization, addressing limitations of synthetic benchmarks.
Findings
ML methods perform worse on large, complex real-world instances.
Performance gaps widen with instance size and complexity.
Some ML approaches outperform classical solvers in specific cases.
Abstract
Machine learning (ML) has shown promise for tackling combinatorial optimization (CO), but much of the reported progress relies on small-scale, synthetic benchmarks that fail to capture real-world structure and scale. A core limitation is that ML methods are typically trained and evaluated on synthetic instance generators, leaving open how they perform on irregular, competition-grade, or industrial datasets. We present FrontierCO, a benchmark for evaluating ML-based CO solvers under real-world structure and extreme scale. FrontierCO spans eight CO problems, including routing, scheduling, facility location, and graph problems, with instances drawn from competitions and public repositories (e.g., DIMACS, TSPLib). Each task provides both easy sets (historically challenging but now solvable) and hard sets (open or computationally intensive), alongside standardized training/validation…
Peer Reviews
Decision·ICLR 2026 Poster
[Problem motivation & clarity] The introduction tightly argues why synthetic, small-scale evaluations have over-estimated ML performance and motivates a real-world, frontier-scale benchmark. The easy/hard split, data provenance, and scale claims are clearly articulated, making the problem and goals unambiguous. [Benchmark design & practicality] The benchmark aggregates eight diverse CO tasks with real-world test instances and provides standardized BKS and synthetic training/dev resource
1. The paper’s treatment of metric edge cases lacks technical clarity. While the primal-gap policy is defined—including handling of negative or zero costs and infeasible outputs—there are few concrete examples that cover both minimization and maximization settings. This matters because tasks can differ in objective sign and scale, and without worked examples readers may interpret the same numeric gap differently across problems. The gap definition section would benefit from illustrative cases; a
1. The methodology of the paper is good and well presented. 2. Developing standardized benchmarks for evaluating ML-based CO solvers is necessary for progress in this field. 3. Proposed benchmark cover routing, graph, location, set, and scheduling CO problems. However, another scheduling problem would be welcome.
I believe that some of the claims in this paper are too strong and not supported by the current state of research in this domain. It is true, in general, that many neural solvers rely on attention mechanisms and suffer from well-known attention bottlenecks, and this paper points out that addressing these limitations is an important direction for future research. However, many works have already explored ways to mitigate these bottlenecks, which are entirely overlooked in this work. Some relev
A large scale and wide benchmark makes a lot of sense for ML4CO field. The selection of problems are reasonable. The best known solution is a good guidance for CO practitionerrs. And the findings about current ML vs solvers are important guidelines for future research.
First of all, there is no clear weakness of the paper. There are some small concerns that I have: - The problem instances in table 1 are all collected from previous literature or competitions, which might limit the originality of the benchmark. - The standardized training and validation makes sense to some extent, but it may not be fair comparison. e.g., some methods may subsume scaling law and performs better with more training data, while other methods might be suitable for data scarcity but
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Vehicle Routing Optimization Methods · Graph Theory and Algorithms
