FORGE: Foundational Optimization Representations from Graph Embeddings

Zohair Shafi; Serdar Kadioglu

arXiv:2508.20330·cs.LG·September 25, 2025

FORGE: Foundational Optimization Representations from Graph Embeddings

Zohair Shafi, Serdar Kadioglu

PDF

Open Access 3 Reviews

TL;DR

Forge introduces a pre-trained graph autoencoder that creates versatile, unsupervised embeddings for combinatorial optimization problems, improving solver performance and generalization across diverse problem types and sizes.

Contribution

A novel unsupervised pre-training framework for optimization instance representations that enhances solver guidance and generalizes across multiple problem distributions.

Findings

01

Forge embeddings cluster unseen instances effectively.

02

Pre-trained Forge improves solver performance in supervised tasks.

03

Outperforms state-of-the-art learning methods in optimization tasks.

Abstract

Combinatorial optimization problems are ubiquitous in science and engineering. Still, learning-based approaches to accelerate combinatorial optimization often require solving a large number of difficult instances to collect training data, incurring significant computational cost. Existing learning-based methods require training dedicated models for each problem distribution, for each downstream task, severely limiting their scalability and generalization. We introduce Forge: Foundational Optimization Representations from Graph Embeddings, a framework that pre-trains a vector-quantized graph autoencoder on a large, diverse collection of mixed-integer programming (MIP) instances in an unsupervised manner, without relying on optimization solvers or optimal solutions. Vector quantization produces discrete code assignments that serve as a vocabulary for representing optimization instances.…

Tables1

Table 1. Table 1: Ablation study on codebook size of the Forge architecture measuring NMI scores.

Codebook Size	NMI
500	$0.810 \pm 0.027$
1,000	$0.818 \pm 0.030$
2,500	$0.822 \pm 0.026$
5,000	$0.843 \pm 0.031$
10,000	$0.805 \pm 0.022$

Equations42

L = L_{R ec} + L_{C o d e b oo k} + L_{C o mmi t m e n t}

L = L_{R ec} + L_{C o d e b oo k} + L_{C o mmi t m e n t}

L_{R ec} = (A - \hat{X} \hat{X^{T}})^{2} + \frac{1}{N} i = 1 \sum N (\overset{v_{i}}{^} - v_{i})^{2}

L_{R ec} = (A - \hat{X} \hat{X^{T}})^{2} + \frac{1}{N} i = 1 \sum N (\overset{v_{i}}{^} - v_{i})^{2}

L_{C o d e b oo k} = \frac{1}{N} i = 1 \sum N ∥ s g [h_{i}] - c w_{i} ∥_{2}^{2}

L_{C o d e b oo k} = \frac{1}{N} i = 1 \sum N ∥ s g [h_{i}] - c w_{i} ∥_{2}^{2}

L_{C o mmi t m e n t} = \frac{α}{N} i = 1 \sum N ∥ s g [c w_{i}] - h_{i} ∥_{2}^{2}

L_{C o mmi t m e n t} = \frac{α}{N} i = 1 \sum N ∥ s g [c w_{i}] - h_{i} ∥_{2}^{2}

min S \in S \sum x_{S}

min S \in S \sum x_{S}

S : e \in S \sum x_{S} \geq 1 \forall e \in U

S : e \in S \sum x_{S} \geq 1 \forall e \in U

x_{S} \in {0, 1} \forall S \in S

x_{S} \in {0, 1} \forall S \in S

min v \in V \sum x_{v}

min v \in V \sum x_{v}

x_{u} + x_{v} \geq 1 \forall (u, v) \in E

x_{u} + x_{v} \geq 1 \forall (u, v) \in E

x_{v} \in {0, 1} \forall v \in V

x_{v} \in {0, 1} \forall v \in V

min j = 1 \sum n y_{j}

min j = 1 \sum n y_{j}

j = 1 \sum n x_{ij} = 1 \forall i \in {1, \dots, m}

j = 1 \sum n x_{ij} = 1 \forall i \in {1, \dots, m}

i = 1 \sum m w_{i} x_{ij} \leq C \cdot y_{j} \forall j \in {1, \dots, n}

i = 1 \sum m w_{i} x_{ij} \leq C \cdot y_{j} \forall j \in {1, \dots, n}

x_{ij} \in {0, 1}, y_{j} \in {0, 1}

x_{ij} \in {0, 1}, y_{j} \in {0, 1}

ma x v \in V \sum x_{v}

ma x v \in V \sum x_{v}

x_{u} + x_{v} \leq 1 \forall (u, v) \in E

x_{u} + x_{v} \leq 1 \forall (u, v) \in E

x_{v} \in {0, 1} \forall v \in V

x_{v} \in {0, 1} \forall v \in V

μ_{sc}

μ_{sc}

μ_{b p}

d_{sc - b p}

e_{u p d a t e d_m v c}

e_{u p d a t e d_m v c}

L (a, p, n) = ma x {d (a_{i}, p_{i}) - d (a_{i}, n_{i}) + ma r g in, 0}

L (a, p, n) = ma x {d (a_{i}, p_{i}) - d (a_{i}, n_{i}) + ma r g in, 0}

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 5

Strengths

1. FORGE leverages unsupervised learning to pre-train embeddings from a wide range of MIP instances, creating general-purpose representations that can be used across multiple optimization tasks and problem sizes without needing optimal solutions or labeled data. This broadens its applicability compared to other supervised or task-specific models. 2. The FORGE embeddings show strong generalization capabilities across different problem domains and sizes. The method not only works for one specific

Weaknesses

1. Although FORGE performs well, the interpretability of the learned embeddings remains unexplored. The paper briefly mentions that some codes may represent local structures like cliques, but a deeper understanding of how different parts of the embedding space correspond to specific problem features would be valuable. 2. While the model is compact (with 3.25 million parameters), training on very large datasets might still pose challenges in terms of computational resources. The ability to scale

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is pioneering in its attempt to build a “foundational model” for optimization, similar to those in natural language processing (NLP) and computer vision (CV). By combining bipartite graph structures with a vector-quantized autoencoder, it achieves general structural encoding for optimization instances. 2. The model’s training does not depend on optimal solutions or solver outputs, enabling large-scale unsupervised pretraining across diverse problem distributions. 3. The authors pr

Weaknesses

1. **Limited Interpretability**: The discrete “optimization vocabulary” generated through vector quantization lacks semantic clarity, making it difficult to interpret the structural meaning of each code. This limitation affects the model’s scientific transparency and trustworthiness. Providing insights into relational structures between tokens, similar to language model embeddings, would strengthen the paper’s claims. 2. **Restricted Solver Integration**: The embeddings are currently applied on

Reviewer 03Rating 8Confidence 3

Strengths

The paper is well written and easy to follow. The motivation and problem space are clearly articulated, and constitute an important area of research. The results are strong.

Weaknesses

I would appreciate details of the loss function in the paper itself. At the very least, the notation that is used in the main text should be introduced outside of the appendix. Typo line 338 “results” -> “result” The discussion of limitations reads mainly like a list of strengths and future work. Do the authors anticipate any drawbacks to this method? Are there problems that are typically cast as MIP instances that are likely to be difficult using FORGE?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Graph Theory and Algorithms · Constraint Satisfaction and Optimization

Full text

Forge: Foundational Optimization Representations from Graph Embeddings

Zohair Shafi,1,2 and Serdar Kadıoğlu2, 3

1Khoury College of Computer Science, Northeastern University {[email protected]}

2AI Center of Excellence, Fidelity Investments

3Department of Computer Science, Brown University {[email protected]}

Abstract

Combinatorial optimization problems are ubiquitous in science and engineering. Still, learning-based approaches to accelerate combinatorial optimization often require solving a large number of difficult instances to collect training data, incurring significant computational cost. Existing learning-based methods require training dedicated models for each problem distribution, for each downstream task, severely limiting their scalability and generalization. We introduce Forge: Foundational Optimization Representations from Graph Embeddings, a framework that pre-trains a vector-quantized graph autoencoder on a large, diverse collection of mixed-integer programming (MIP) instances in an unsupervised manner, without relying on optimization solvers or optimal solutions. Vector quantization produces discrete code assignments that serve as a vocabulary for representing optimization instances. We evaluate Forge in both unsupervised and supervised settings. In the unsupervised setting, Forge embeddings effectively cluster unseen instances across problem domains and sizes. In the supervised setting, we fine-tune Forge embeddings and show that a single pre-trained model helps predicting both the integrality gap for cut-generation and variable hints for search guidance across multiple problem and size distributions. In both tasks, we improve the performance of a commercial optimization solver and outperform state-of-the-art learning-based methods. Finally, we open-source our training code, pre-trained Forge weights, and embeddings for multiple MIP distributions to foster further research in representation learning for optimization problems.

1 Introduction

Combinatorial Optimization (CO) problems are fundamental in science and engineering with applications spanning diverse domains, including logistics, energy systems, network design, and recommendations (Gabriel Crainic et al., 2021; Miller et al., 2023; Kadıoğlu et al., 2024). Traditionally, CO problems have been solved using carefully designed meta-heuristics and sophisticated optimization solvers. While effective, such classical methods demand significant domain expertise and computational resources, especially as problem size and complexity grow.

Recent advances in Machine Learning (ML) have introduced promising alternatives for solving CO problems. The approaches for ML-guided optimization fall mainly under two categories: 1) end-to-end models that predict solutions or objective without relying on solvers or meta-heuristics, and 2) hybrid methods which either 2a) replace computationally intensive solver components with learned models, e.g., learning to predict strong branching heuristics (Gasse et al., 2019), or 2b) guide meta-heuristics with learning from feedback (Cai et al., 2025b). For a comprehensive overview of ML-guided CO, we refer readers to the survey by Bengio et al. (2021b).

Despite their potential, learning-based methods face practical limitations. A significant drawback of learning approaches is their heavy dependency on offline training. Training is computationally costly, depends on carefully curating training datasets with desired properties and distributions of the underlying CO instances, and has limited generalization. Ironically, training depends on optimization solvers, that they try to accelerate, to create labeled datasets. This defeats the purpose of improving solving for challenge instances that optimization solvers cannot deal with today. Adapting learning-based methods to new distributions and domains remains a challenge, and hence, designing a foundational optimization representation in an unsupervised fashion applicable to multiple optimization scenarios is a much-needed alternative. This is exactly what we study in this paper.

Our main motivation is grounded in the exceedingly successful foundational methods in other modalities, such as text and image embeddings. This raises a natural question for optimization: can we leverage the abundance of publicly available mixed-integer programming (MIP) instances to develop a pre-trained, general-purpose foundational model for MIP representations that serves multiple optimization tasks, across varying problem domains and sizes?

The growing success of ML-based approaches for optimization problems makes this direction imminent. For example, Zhou et al. (2023) propose a meta-learning framework that generalizes across variants of vehicle routing problems of different sizes, but remains limited to routing. Similarly, Cai et al. (2025a) introduce a multi-task framework for backdoor prediction and solver configuration, however, the shared model is trained separately for each problem. Likewise, Li et al. (2025) propose an LLM-based evolutionary framework to generate diverse MIP problems, but is supervised, and requires a large number of pre-solved instances. While promising, the existing work either generalizes across multiple tasks but remains problem-specific, or scales across different sizes and variants but is task-specific, or remains dependent on solvers. We envision a foundational model that produces MIP embeddings generalizable across tasks, problems, and sizes, trained in an unsupervised fashion, without relying on solving hard optimization problems.

When we look at Natural Language Processing (NLP) and Computer Vision (CV), foundational models have emerged through unsupervised or self-supervised training, enabled by the abundance of data. CO problems, while also benefiting from publicly available datasets, span highly heterogeneous problem types (e.g., Set Covering vs. Combinatorial Auction) and exhibit significant variability within each problem. Most ML-based approaches to CO rely on Graph Neural Networks (GNNs), as proposed in (Gasse et al., 2019), which effectively capture local variable and constraint-level information but struggle to encode meaningful global structure. This limitation, rooted in the inherent locality bias of GNNs, is analyzed in detail by (Feng et al., 2025). As a result, while embeddings for other modalities such as text, image, and audio are now widespread, no general-purpose instance-level embeddings exist for MIPs to date.

With this vision in mind, we propose Forge: Foundational Optimization Representations from Graph Embeddings. Forge is a foundational model designed to generate MIP embeddings through a pre-training framework that learns structural representations at the instance level in an unsupervised manner, using a broad distribution of MIP instances without requiring access to their solutions.

To achieve this, we incorporate two key ideas; one inspired by NLP and the other by CV. From NLP, we adopt the concept of a vocabulary to represent the latent space of optimization problems, enabling instance-level representations. From CV, we leverage vector quantization to preserve global information, addressing the limitations of GNN-based approaches in prior CO work. By extending these two crucial insights into the CO context, we make the following concrete contributions:

➪

A Foundational Model for Optimization: We propose Forge, a general-purpose foundational model for generating MIP embeddings (§3). Forge captures both local and global structures critical to optimization. Unlike prior work, a single pre-trained Forge model provides embeddings at multiple levels: instance-level representations (one vector per MIP instance), and fine-grained variable and constraint embeddings.

➪

Unsupervised Generalization: In the unsupervised setting, Forge embeddings cluster previously unseen instances across diverse problem types with high accuracy (§4).

➪

Supervised Adaptability: In the supervised setting, pre-trained Forge embeddings can be fine-tuned on diverse downstream tasks using minimal additional data and a low-cost labeling strategy that avoids solving to optimality. We evaluate Forge on two distinct tasks: estimating integrality gap for cut generation (5.1) and predicting variables for search guidance (5.2). Notably, a single pre-trained Forge model is fine-tuned and applied across varied domains and problem sizes.

➪

Solver Integration: To enhance traditional optimization solvers, we integrate Forge predictions into Gurobi (Gurobi Optimization, LLC, 2024), a state-of-the-art commercial solver, and demonstrate consistently lower primal gap across both tasks and a wide range of problem domains and sizes.

➪

ML Augmentation: To enhance ML-guided optimization, we evaluate Forge against (Li et al., 2025) for integrality gap prediction, and PS-Gurobi (Han et al., 2023) for search guidance, improving their performance on large sets of instances they were trained on, yet unseen by Forge.

2 Background: Mixed-Integer Programming

Let us start with a brief background on Mixed-Integer Programming (MIP) that formulates combinatorial optimization problems of the form: $f(x)=\min\{c^{T}x\mid Ax\leq b,x\in\mathbb{R}^{n},x_{j}\in\mathbb{Z}\ \forall j\in I\}$ where $f(x)$ is the objective function, and $A\in\mathbb{R}^{m\times n},b\in\mathbb{R}^{m},c\in\mathbb{R}^{n}$ , and the non-empty set $I\subseteq{1,...,n}$ indexes the integer variables. The Linear Programming (LP) relaxation of a MIP $x_{lp}$ is obtained by relaxing integer variables to continuous variables, i.e., by replacing the integer constraint $x_{j}\in\mathbb{Z}\ \forall j\in I$ to $x_{j}\in\mathbb{R}\ \forall j\in I$ . The LP relaxation is an essential part of the branch-and-bound algorithm for providing bounds. The integrality gap measures how much worse the optimal solution of the LP relaxation when compared to the optimal solution of the original MIP.

3 Forge: Unsupervised Representation Learning for MIPs

MIP instances are typically represented as a bipartite graph between variables and constraints augmented with node features (Gasse et al., 2019; Ferber et al., 2022; Yau et al., 2024; Chen et al., 2023). This is then followed by training a GNN in a supervised fashion for a specific downstream task on a certain problem class. For example, in Han et al. (2023), the GNN is used for predicting variables for warm-starts trained on Set Cover (SC) and Independent Set (IS) problems. Similarly, in Li et al. (2025), a GNN is used for predicting the integrality gap. Numerous variants follow this template (see related works in §6 and Appendix A.9). Notice that, all of these methods require supervision and do not yield a general-purpose MIP embeddings at the instance level. Taking this a step further; our goal is to learn the structure of MIP instances in an unsupervised manner.

Figure 1 presents our overall architecture, which is composed of these main building blocks:

A) MIP-to-BP: Given a MIP instance, we start with its bipartite (BP) representation and node features. Each node in this bipartite graph represents a constraint or a variable, with edges indicating which variables are part of which constraints. Each node is associated with node features and each edge is weighted by the coefficient of the variable in the constraint. Node features are typically extracted from the internal branch-and-bound search tree when solving an instance. For example, there are 18 input node features used in (Gasse et al., 2019). We do not attempt to solve or depend on the solution of the instance. Instead, Forge only uses the basic properties of the input instance. For each constraint node, we introduce 4 features composed of its sense (i.e., $>$ , $<$ or $=$ ) and the RHS value. For each variable node, we introduce 6 features composed of its type (integer, binary, continuous), upper/lower bound, and the coefficient in the objective function. In total, for each node, we obtain a vector of size 10, padded with zeros accordingly based on node type (Figure 1-A).

B) BP-to-GNN: This bipartite graph with 10-dimensional input node features is passed into a GNN, akin to previous works, to generate embeddings for each constraint and variable node. More specifically, Forge uses two GraphSage (Hamilton et al., 2017) layers that project each input node into a $d$ dimensional embedding space. As discussed earlier, while GNNs are good at capturing local variable- and constraint-level information, they struggle to capture meaningful global information at the instance level due to their inherent locality bias (Feng et al., 2025). Preserving global structure is important in CO problems, especially to generalize across problem types (Figure 1-B).

C) Vector Quantized Codebook: To preserve global structure, we introduce a vector quantized codebook with $k$ discrete codes. These codes act as a ‘vocabulary’, akin to language models, across MIP instances of various domains and difficulties, thereby preserving global structure. The design follows the approaches developed in computer vision (Van Den Oord et al., 2017; Yu et al., 2022; Lee et al., 2022) and the structure-aware graph tokenizer extension in Yang et al. (2024) (Figure 1-C).

D) GNN-to-CW: GNN embeddings are passed into a vector quantizer which consists of a codebook with $k$ codes. The codebook maps each variable and constraint node to a discrete code. Each code is then mapped into a $d$ dimensional codeword (CW), producing CW representations for constraints and variables, aligned with the dimensionality of the hidden GNN layers (Figure 1-D).

E) CW-to-BP: We use the CW corresponding to each constraint and variable node to reconstruct the original bipartite representation of the MIP instance. These codewords are passed into a linear node feature decoder and a linear edge decoder to reconstruct the input bipartite graph. By doing so, we obtain an unsupervised method that learns from the structure of MIP instances (Figure 1-E).

F) Loss Function: Our loss function minimizes the edge reconstruction loss, the node feature reconstruction loss, and losses related to the vector quantization. Concretely, the loss function is: $\mathcal{L}=\mathcal{L}_{Rec}+\mathcal{L}_{Codebook}+\mathcal{L}_{Commitment}$ where given $N$ nodes, input node feature $v_{i}\ \forall i\in N$ , the adjacency matrix $A$ and a matrix $\hat{X}$ composed of reconstructed input features $v_{i}$ . Specifically, reconstruction loss, $\mathcal{L}_{Rec}=(A-\hat{X}\hat{X^{T}})^{2}+\frac{1}{N}\sum_{i=1}^{N}(\hat{v_{i}}-v_{i})^{2}$ , the codebook loss, $\mathcal{L}_{Codebook}=\frac{1}{N}\sum_{i=1}^{N}\|sg[h_{i}]-cw_{i}\|_{2}^{2}$ , and commitment loss, $\mathcal{L}_{Commitment}=\frac{\alpha}{N}\sum_{i=1}^{N}\|sg[cw_{i}]-h_{i}\|_{2}^{2}$ . For more details on the loss function, we refer to Appendix A.1.

Once Forge is trained in this unsupervised manner across a corpus of MIP instances, we obtain:

Local Constraint & Variable Representations: Each node in the bipartite graph is assigned a discrete code which is mapped to a codeword. These become the constraint and variable embeddings.

Global MIP Instance Representation: We leverage the distribution of codes at the instance level. Each instance is represented with an embedding of vocabulary size, $\lvert codebook\rvert$ , where each value indicates the frequency of the corresponding code. See Figure 7 in Appendix A.2 for an example.

Next, we start with an initial investigation of the effectiveness of Forge embeddings when clustering unseen instances across various problem domains (§4), and then explore how they can enhance solving MIP instances via search guidance and cut-generation(§5).

4 Initial Analysis of Forge Embeddings

Given the absence of methods in the literature that can provide general-purpose MIP embeddings at the instance level, with the exception of earlier works that depend on creating hand-crafted features to classify MIP instances (e.g., (Kadioglu et al., 2010)), we start our initial analysis with a comparison of Forge embeddings against two (ablation) baselines for clustering unseen instances.

We investigate both the accuracy (quantitative) as well as visual inspection (qualitative) of the clustering. To do so, we train Forge on a set of MIP instances from MIPLIB (Gleixner et al., 2021), and test it on unseen instances from Distributional MIPLIB (D-MIPLIB) (Huang et al., 2024) and also on strIPlib (Bastubbe et al., 2025) in Appendix A.3. While MIPLIB is a mixed dataset, D-MIPLIB and strIPlib are categorized into different problem classes and difficulties. This serves as a ground truth label when we treat each problem class as a cluster to evaluate our embeddings.

**Training: ** We use 600 instances from MIPLIB, sorted by size to ensure the resulting bipartite graphs fit on GPU memory. For additional training data, we generate two instances from each MIPLIB instance by randomly dropping 5% and 10% of constraints (note that dropping constraints only relaxes the problem). In total, we obtain 1,800 MIP instances to train Forge. We use two GraphSage layers with $d=1024$ dimensions and a $codebook$ with $k=5000$ codes, the size of the vocabulary. We conduct an ablation study on the codebook size in Appendix A.4.

**Testing: ** We evaluate Forge on clustering 1,050 instances from D-MIPLIB categorized into 21 domain-difficulty pairs. These include Set Cover (SC easy, medium, hard), Max. Independent Set (MIS easy, medium), Min. Vertex Cover (MVC easy, medium, hard), Generalized Independent Set (GIS easy, medium, hard, very-hard, very-hard2, ext-hard), Combinatorial Auction (CA very-easy, easy, medium, very-hard, very-hard2), Item Placement (IP very-hard) and Maritime Inventory Routing (MIRP medium), covering a broad spectrum of problem domain, sizes and difficulty levels.

**MIP Embeddings: ** These 1,050 unseen instances are passed into the pre-trained Forge model to generate one embedding per MIP instance. For comparison, we consider two alternatives. As a baseline, we use the Mean Readout of the GNN embeddings within the trained Forge model. This generates the embedding of an instance by averaging all node features from the GNN (Fig.1-B). This behaves as an ablation for Forge without vector quantization. Given the weakness of GNN embeddings in capturing global structure, we expect this to perform poorly. Alternatively, starting from the 10-dimensional static node features of the bipartite graph (Fig. 1-A), we run two-hop label propagation and average the resulting node vectors (Zhu & Ghahramani, 2002). Note that hand-crafted, static instance descriptors (e.g., (Kadioglu et al., 2010)) cannot be used, since every instance within the same category has identical statistics (# of vars, constraints, etc.) and is non-informative.

**Clustering Visualization: ** Figure 2 visualizes these embeddings vectors, projected into two dimensions using PaCMAP (Wang et al., 2021). Each dot represents an instance colored by its category. As expected, the Mean Readout method loses the global structure leading to arbitrary clusters. The Label Propagation performs much better capturing the structure, while some problems and sizes are mixed. Forge embeddings performs the best where each problem domain are separated cleanly. Moreover, within each domain, the difference in problem difficulty is also observed from the easy to hard categories. Recall that Forge is not trained on these instances; but only on MIPLIB.

**Clustering Accuracy: ** For quantitative evaluation, we run k-means with 21 clusters, expecting one cluster for each category. We calculate the normalized mutual information score (NMI) between the ground truth categories and clusters, averaged over 10 runs of k-means, for each method. If the predicted clusters are identical to the original categories, the NMI score is 1.0, and 0.0 otherwise. As qualitatively observed from the 2D visualization, the Mean Readout performs poorly with an NMI score of 0.087. This is only slightly better than random cluster predictions with an NMI score of $1/21=0.047$ . This is likely due to over-smoothing of averaging dense GNN embedding vectors across a large number of constraints and variables. In contrast, Label Propagation performs better with an NMI score of 0.790. This method operates directly on the sparse input features of the bipartite graph, hence avoiding the over-smoothing problem encountered in GNNs. The best performance is achieved by Forge with an NMI score of 0.843. Interestingly, the mean readout embeddings, which perform very poorly, is based on the the same pre-trained GNN backbone in Forge. By utilizing the distribution of the discrete codes of constraints and variables, Forge circumvents the over-smoothing issue and captures the global structure of unseen MIP instances. As a proof of concept, we apply vector arithmetic on these MIP embeddings, drawing inspiration from language models and the well-known King - Man + Woman $\approx$ Queen example (Ethayarajh et al., 2018), to evaluate Vertex Cover - Cover + Packing $\approx$ Independent Set (see Appendix A.5).

5 Experiments

So far, we only demonstrate that Forge embeddings reliably cluster unseen instances from diverse problem classes. Our ultimate goal is to improve MIP solving, hence we now shift to supervised evaluations. We design the next set of experiments on downstream tasks that (1) provide utility to enhance solving MIPs, (2) commonly applied in the literature to enable a fair comparison, (3) radically different from each other, whereby we use the same Forge model to validate its general applicability across tasks, problems, and sizes, and (4) agnostic to the underlying MIP solver, i.e., we do not depend on internal access to specific solver procedures within the branch-and-bound tree.

We therefore study two fundamentally different downstream tasks: predicting the integrality gap of a given MIP instance, as also studied in (Li et al., 2025), and predicting variables for search guidance, as also studied in (Han et al., 2023). Despite being unrelated, both tasks serve as primal heuristics to speed-up MIP solving by obtaining better solutions faster. The integrality gap is used to generate a pseudo-cut added to the original problem formulation to tighten its bound at the beginning of the search, whereas variable guidance is used to provide hints to the solver during search.

The critical aspect of these experiments is that we use the same pre-trained Forge model to obtain general-purpose MIP embeddings that are then fine-tuned on a small number of cheaply labeled data to learn prediction heads for completely different tasks, as shown in Figure 3-A. An analogy from NLP is to generate pre-trained word embeddings using a foundational model, and then to fine-tune prediction heads for entity extraction in a specific domain, e.g., finance, using a small set of labels. Our goal is to revive the success of foundational models from NLP and CV in the context of CO.

Training the Foundational Forge Model: The setup for training Forge is identical to the setting in the previous clustering analysis, except that we now train on both 1,800 MIPLIB instances as well as the 1,050 D-MIPLIB instances. In total, Forge is trained on a corpus of 2,850 MIP instances a model with 3.25 million parameters. We again use two GraphSage layers with $d=1,024$ and $k=5000$ . Details of our experimental setup, parameters, and machines are in Appendix A.6.

5.1 Task - I: Predicting the Integrality Gap for Pseudo-Cut Generation

Integrality Gap Prediction: As briefly mentioned in (§ 2), the integrality gap measures the ratio between the optimal solution of the LP relaxation and the optimal solution of the original integer program. Intuitively, this gap quantifies the quality of the approximation offered by the LP relaxation. A smaller gap means the LP relaxation is a good approximation and is close to the value of the best integral solution. Here, we are interested in predicting the integrality gap of a given MIP instance, without solving for its optimal integral solution value, as also studied in (Li et al., 2025).

Pseudo-Cut Generation: If we can predict the integrality gap, then we can generate a pseudo-cut and add it as an additional constraint to the model to immediately bound the optimal objective value from the root node LP relaxation. Note that, this cut is not guaranteed to be a valid cut at all times, hence it is a pseudo-cut. If the integrality gap prediction is incorrect, it risks over (or under) estimating the best objective value. This makes integrality gap prediction a challenging problem.

Training Instances: We do not expect this task to generalize between problem classes and/or sets of varying complexities of a given problem class. For instance, there is no reason for an LP gap of 70% to be the same between easy vs. hard SC instances. Figure 3-B shows the distribution of integrality gap across different categories, and as expected, the integrality gap can be anywhere from 5% to 95% with wide distributions in between. As such, there is no magic constant that one could use heuristically at all times, which makes integrality gap prediction a deliberate learning task. For training, we only consider CA (very-easy, easy, medium), SC (easy, medium, hard), and GIS (easy, medium, hard) with 50 instances for each. In total, we obtain 450 training instances. Note that this is a considerably smaller training set than initially used for pre-training Forge embeddings.

Label Collection: To generate training labels, each training instance is solved using Gurobi with 120s time limit. Note that, this does not require solving instances to optimality, which can be costly. The numeric label is defined as the ratio between the integer solution at the time-out and the the LP relaxation. As mentioned, overestimating the integrality gap (for minimization problems) can lead to suboptimal solutions, as the solver may terminate prematurely. In contrast, underestimating the gap is generally acceptable, as it still facilitates faster solve times without compromising solution quality. This asymmetry is accounted for by adopting a conservative labeling strategy by setting a timeout when collecting labels. This often results in underestimating the true integrality gap, especially for hard instances. By doing so, we reduce the risk of the model overestimating the gap, thereby improving the reliability of the predicted cut.

Supervised Fine-Tuning: Given this small labeled data, a dense prediction head is added to the pre-trained Forge model, as shown in Figure 3-A, that takes codewords assigned to each node as input and outputs a real number using mean readout across all nodes. As a regression task, this is trained with the mean absolute error loss in an end-to-end manner.

Test Instances & Setup: We use the fine-tuned Forge to predict the integrality gap of 50 very-hard instances each of CA, SC, GIS, and MVC. Our fine-tuning does not include ‘very hard’ category, and MVC is entirely unseen in fine-tuning. Given the prediction, a pseudo-cut is generated by adjusting the initial LP relaxation objective and incorporating into the original formulation as an additional constraint. This enforces the integral objective to exceed (or fall below) the generated pseudo-cut.

Prediction Accuracy: We measure the deviation in mean absolute error between the known integrality gap and the gap predicted by Forge. On these very-hard test instances, Forge achieves a deviation of $15.42\%$ , $13.55\%$ , $12.03\%$ and $19.077\%$ for CA, SC, GIS, and MVC, respectively. As an ablation, training for this task from scratch, and not using pre-trained Forge as starting weights worsens the error by ${\sim}$ 33% on average across all categories. This highlights the importance of unsupervised pre-training to capture transferable structural patterns across diverse MIP instances.

Comparison with Commercial MIP Solver: We compare the commercial Gurobi solver on these very-hard instances with and without our predicted pseudo-cut. Figure 4 shows the primal gap averaged over 50 instances of each problem with 3600s time limit. Without exception, across all problem types, the use of pseudo-cuts generated by Forge consistently results in better primal gaps. The solver improves the gap early in the search, and our pseudo-cuts, make these gains immediately more pronounced. The performance gains in the primal gap reach up to 85%. Recall that fine-tuning for this task only needed 50 instances per problem type, MVC was not included in the fine-tuning, and fine-tuning had no dependency on optimal solutions to generate the labeled data.

Comparison with SOTA ML: To further evaluate generalization across problem types and sizes, we compare against the setup in Li et al. (2025), where a GNN is trained on 38,256 instances from 643 generated problem types and tested on 11,584 instances from 157 problem types. Forge is used as is, without any additional training, and tested on 17,500 previously unseen instances from 400 generated problem types.** Forge achieves a mean deviation of 18.63% in integrality gap prediction**, improving over the 20.14% deviation reported in (Li et al., 2025).

5.2 Task - II: Guiding the Search for Optimal Solutions

The previous task evaluates the global Forge representations at the instance level while our next task evaluates the local Forge representations, specifically the variable embeddings for search guidance. The idea is to fine-tune Forge, on a smaller dataset with a labeling strategy that does not depend on solving to optimality, and provide variable hints to the Gurobi solver.

Training & Labeling: We collect 100 instances from CA (easy, medium), SC (easy, medium, hard) and GIS (easy, medium) for a total of 700 training instances. Each instance is solved using Gurobi to find a pool of five feasible solutions within five minutes. Optimality is not required for labeling.

Supervised Fine-Tuning: Given five feasible solutions, variables that never appear in any solution is marked as ‘negative’ and variables that appear in a solution at least once is marked as ‘positive’. Forge is fine-tuned using a combination of binary cross-entropy (BCE) and triplet loss. For BCE, we add a dense prediction head to pre-trained Forge as in integrality gap prediction (Figure 3-A). In parallel, we use the standard triplet loss (Schroff et al., 2015), where variables that appear in the same number of solutions are treated as ‘positive’ and ‘anchor’ pairs (Figure 3-C). A key challenge of the triplet loss is to identify good negatives, as trivial negatives do not help learning. Our pre-trained Forge helps circumvent this issue: for every positive/anchor pairs of variables, we select the negative variable that is closest to the anchor variable in the unsupervised Forge embedding space. We fine-tune Forge to minimize the sum of the BCE and triplet losses, weighted equally. One loss minimizes the binary labeling error and the other loss minimizes the distance between the embeddings of the ‘anchor’ and ‘positive’ variables while ensuring the ‘negative’ variable is at least ‘margin’ distance away from the ‘anchor’ variables. Details on the triplet loss are in Appendix A.7.

Test Instances & Setup: We test on 50 medium instances from each of CA, SC, GIS, and MVC. Again, MVC is unseen in fine-tuning. We use our fine-tuned Forge to predict the likelihood of variables to appear in the solution. For search guidance to the solver, we begin with a feasible solution found by Gurobi within 1s. Variables appearing in this solution serve as anchors. The neighbors within a fixed radius of the positive anchors in the embedding space, that are also in the top-decile of the Forge prediction head, are hinted to the solver for inclusion. Conversely, the neighbors of the negative anchors that are in the bottom-decile of the Forge predictions are hinted to the solver for exclusion. This strategy exploits our training objective that optimizes for variable prediction using BCE loss and clustering of positive and negative variables using triplet loss.

Comparison with the Commercial MIP Solver: We compare the performance of Gurobi solver on these instances with and without our search guidance. Figure 5 shows primal gaps averaged over 50 instances for each problem under a 3600s time limit. As in the previous experiments, the commercial solver, when powered by the search guidance from Forge, achieves consistently better primal gaps (up to 48% improvements) and converges to optimal solutions significantly faster (up to 35% speed-ups). In short, Forge makes Gurobi faster and better on all tested problems.

Comparison with SOTA ML: As a final experiment, we test the ability of Forge to augment not only a MIP solver but also other ML-methods. For this, we augment the SOTA ML method, PS-Gurobi (Han et al., 2023), which also studies search guidance for Gurobi, and already demonstrates strong performance against previous learning-based approaches such as (Nair et al., 2020). We use our pre-trained Forge embeddings as-is and concatenate variable and node embeddings of PS-Gurobi with our unsupervised embeddings (after PCA to reduce to 64 dimensions to fit into to the architecture of PS-Gurobi).

We evalaute on the common subset of problems, CA and GIS, used in PS-Gurobi experiments. As shown in Figure 6, augmenting PS-Gurobi with our Forge embeddings yields significant improvements: over 40% reduction in primal gap on average for CA instances, and over 50% reduction on average for GIS instances. Additional randomized control ablations are provided in Appendix A.8.

6 Related Work

We cover immediately relevant work here and elaborate further in Appendix A.9. In terms of unsupervised learning, our model is closest to unsupervised approaches studied in (Sanokowski et al., 2024; Karalias & Loukas, 2020; Bu et al., 2024). However, these works aim to reformulate the discrete, combinatorial objective of specific problems into a differentiable one to learn a solution using gradient descent in an end-to-end manner. Forge is different as it learns the structure of the instance in an unsupervised manner. This gives us the ability to represent any MIP instance off-the-shelf with a single pre-trained model. Other successful ML-based optimization methods show that generalization is possible, e.g., Cai et al. (2025a) introduce multi-task learning using a shared model for predicting both backdoors and solver configuration, albeit trained per problem type. Problem-specific works such as (Zhou et al., 2023) shows generalizes across variants of vehicle routing and Shafi et al. (2025) for variants of set covering problems. The existing work either generalizes across multiple tasks but remains specific to one problem, or scales to different sizes and problem variants but is specific to one task, or remains supervised. Forge is the first to generate MIP embeddings that generalize across multiple tasks, different problems and sizes, and trained in an unsupervised fashion, without dependency on solutions. Regarding downstream optimization tasks, (Li et al., 2025) addresses integrality gap prediction and (Han et al., 2023) focuses on search guidance, both of which we evaluate and compare against. In both cases, with minimal additional training and cheap labeling strategy, Forge consistently matches and improves their performance. There exists other important downstream tasks, such as predicting solver parameters (Hosny & Reda, 2024), learning to branch (Khalil et al., 2016; Liberto et al., 2016), node selection (He et al., 2014a), and cut selection (Paulus et al., 2022). Beyond MIPs, meta-heuristics (Cai et al., 2025b), constraint satisfaction (Tönshoff et al., 2023), and SAT (Duan et al., 2022) show benefits from learning-based approaches. While most of these works, including ours, is based on GNN representations (Gasse et al., 2019), Drakulic et al. (2024) avoids GNNs, and instead, uses mixed attention.

7 Limitations & Future Work

We present Forge, a novel unsupervised framework for learning structural representations of optimization problems at the instance, variable, and constraint levels, without requiring access to solvers or ground-truth solutions. Inspired by NLP and CV, Forge introduces a discrete vocabulary of optimization codes employing vector quantization to capture global structure.

A single pre-trained Forge model clusters unseen instances from diverse benchmarks and generalizes across two distinct optimization tasks on diverse problem domains with varying difficulty levels. These embeddings integrate seamlessly into both a commercial solver and state-of-the-art ML pipelines, consistently yielding measurable performance improvements. Our study is subject to several limitations and opens the door for promising directions for future research:

•

Scale: Forge is compact in size (3.25M parameters trained on ${\sim}$ 2.8 instances), and is much smaller compared to large-scale models in other domains. In principle, it is feasible to train our framework on all publicly available and synthetically generated MIP instances.

•

Interpretability: The semantics of the learned optimization vocabulary remain completely unexplored. Our preliminary evidence suggests that certain codes capture local structure, such as cliques of variables and constraints, enabling generalization to larger problems.

•

Solver Integration: Current experiments are one-shot, using embeddings to generate a pseudo-cut or guide the solver once. Extending this to operate throughout the branch-and-bound tree could enable tighter integration with the solving process for further improvements.

•

Downstream Tasks: Many other important optimization tasks remain unexplored. Forge embeddings can be leveraged for warm-starts, variable selection, node selection, cut selection, solver configuration, and portfolio construction, among others.

•

Generalization: The underlying principles of Forge may extend beyond optimization to constraint satisfaction problems as well as from complete branch-and-bound search to incomplete search methods and meta-heuristics.

To enable these future directions and support reproducible research, we open-source our datasets, training pipelines, pre-trained and fine-tuned Forge models, readily available MIP embeddings across problem distributions from MIPLIB, D-MIPLIB, and strIPlib, and release Forge-Os optimization-as-a-service for on-demand retrieval and generation of optimization embeddings.

Appendix A Appendix

A.1 Details of the Reconstruction Loss

As mention in §3, our loss function minimizes the edge reconstruction loss, the node feature reconstruction loss and losses related to the vector quantization. Concretely, the loss function is:

[TABLE]

where given $N$ nodes, input node feature $v_{i}\ \forall i\in N$ , the adjacency matrix $A$ and a matrix $\hat{X}$ composed of reconstructed input features $v_{i}$ , the reconstruction loss, $\mathcal{L}_{Rec}$ , the codebook loss, $\mathcal{L}_{Codebook}$ , and commitment loss, $\mathcal{L}_{Commitment}$ are given by:

[TABLE]

Here, $sg[.]$ is the stop-gradient operator, $h_{i}$ is the hidden layer representation of node $i$ after the GNN forward pass and $cw_{i}$ is the codeword corresponding to the code that node $i$ has been assigned.

Intuitively, the codebook loss in Eq. 3 can be interpreted as k-means clustering, where the codewords $cw_{i}$ (akin to cluster centroids) are moved closer to the node embeddings $h_{i}$ and the node embeddings $h_{i}$ are fixed in place due to the stop-gradient operator. Conversely, the commitment loss in Eq. 4 fixes the codewords $cw_{i}$ using the stop-gradient operator, and instead, moves the embeddings $h_{i}$ towards the codewords. The hyperparameter $\alpha$ weighs the importance of the commitment loss.

A.2 Example of Instance Level MIP Embeddings

As mentioned in §3, instance-level MIP embeddings are generated by computing the distribution of codes that each variable and constraint in a MIP instance has been assigned. This process is shown in Figure 7 with a codebook of size 5 (for brevity) and the resulting MIP embedding $\vec{emb}=[3,2,3,2,0]$ .

A.3 Additional Clustering Results on strIPlib

To further validate our clustering results from §4, where we pre-trained Forge on 1,800 MIPLIB instances and clustered instances from D-MIPLIB, we repeat the experiment using our MIPLIB-pre-trained Forge to cluster instances from strIPlib (Bastubbe et al., 2025). We select 50 instances from each of 10 previously unseen problem types: Bin Packing, Capacitated Warehouse Location, Cutpacking, General Assignment, Job Shop Scheduling, Map Labeling, Scheduling, Stochastic Server Location, Train Timetabling, and Vertex Coloring.

Figure 8 shows 2D clustering visualizations using PaCMAP. As in our previous study, these strIPlib instances were unseen by Forge, yet the model still clusters different problems cleanly. For example, all packing instances group together in the top-left, while warehouse and server location instances cluster in the bottom-left. Other interesting patterns emerge, such as Train Timetabling and Map Labeling appearing close to each other, suggesting potential transfer learning opportunity.

A.4 Ablation Study on The Codebook Size

An important consideration in vector quantization is determining the appropriate codebook size to effectively capture global structural patterns. While increasing the number of codes might intuitively enhance representation capacity, empirical evidence shows this relationship is non-monotonic due to code under-utilization, where some codes remain unused during training. This phenomenon has been extensively studied in domains such as speech and computer vision (Yu et al., 2022; Zeghidour et al., 2021; Lee et al., 2022). To examine this trade-off, we conduct an ablation study measuring NMI scores for clustering across 1,050 D-MIPLIB instances. Following the setup in §4, we generate Forge embeddings with varying codebook sizes.

Table 1 reports our results showing no statistically significant difference in clustering performance, measured by NMI against the ground truth, across different codebook configurations. We attribute this stability to the clustering objective, which appears less sensitive to codebook size variations compared to downstream predictive tasks.This observation aligns with Yang et al. (2024), who reported consistent classification accuracy across various graph datasets until codebook sizes exceeded 16,000, at which point performance began to degrade. Based on these findings, we set the codebook size to 5,000 in our experiments (§5.1 and §5.2). While our pre-trained Forge model was trained on 2,850 MIP instances, making 5,000 codes potentially excessive, the framework is designed as a foundational architecture ready for extension to much larger datasets.

A.5 Vector Arithmetic in the Latent MIP Embedding Space

Given that we have embeddings for each MIP instance across various problem types, we ask if we can identify certain ‘directions’ in this latent optimization embedding space that could shift a MIP instance from one problem type to the other based on our theoretical understanding. This line of reasoning is inspired by the earlier work on understanding analogies in word embeddings, as in the famous $King-Man+Woman\approx Queen$ example (Ethayarajh et al., 2018).

Concretely, we inspect the relationship among the following covering and packing problems:

A.5.1 Set Cover Problem (SCP)

Given a universe of elements and a collection of subsets, find the smallest number of subsets that cover all elements.

[TABLE]

A.5.2 Vertex Cover Problem

Given a graph, find the smallest set of vertices such that every edge has at least one endpoint in the set.

[TABLE]

A.5.3 Bin Packing Problem

Given a set of items and a collection of bins with a capacity, find the smallest number of bins that pack all items within bin capacities.

[TABLE]

A.5.4 Independent Set Problem

Given a graph, find the largest set of vertices such that no two selected nodes are adjacent.

[TABLE]

Observation: In these problem definitions and formulations, notice how Set Cover and Vertex Cover are similar to each other as covering problems, and Bin Packing and Independent Set are similar to each other as packing problems. Conversely, the pairs are complementary to each other, e.g, the sum of minimum vertex cover and maximum independent set equals to the number of vertices.

Research Question: We ask the following question: if we identify the difference in dimensions between covering and packing, and then push Minimum Vertex Cover instances along that direction, do we obtain embeddings that are closer to Maximal Independent Set instances? The intuition behind this is to remove the ‘cover’ aspect from Vertex Cover, and then add to that the ‘packing’ aspect of Bin Packing, to obtain Independent Set as a result.

To validate this experimentally, we fix the graph size to 1,000 vertices and select 50 random instances each for Set Cover, Vertex Cover, Independent Set from D-MIPLIB (Huang et al., 2024). and 50 random Bin Packing from strIPlib (Bastubbe et al., 2025). We also control for the problem difficulty by ensuring all instances are solvable by Gurobi within 60 seconds.

Vector Arithmetic for Optimization: Next, we apply vector arithmetic on the embeddings of MIP instances to verify our intuition, as a (stretch) analogy to the famous $King-Man+Woman\approx Queen$ example, and test for $VertexCover-SetCover+BinPacking\approx IndependentSet$ . We note that this analogy is not perfect, as there is no *single *‘King’ instance in the optimization embedding space, but instead, we are averaging over multiple instances while controlling for the same graph size and similar difficulty across problems.

Given the embeddings of instances for Set Cover, $e_{sc}$ , Set Packing, $e_{bp}$ , Vertex Cover, $e_{mvc}$ , and Independent Set, $e_{mis}$ , where $e\in\mathbb{R}^{\lvert instances\rvert\times|codes\rvert}$ , we compute the difference in dimensions between covering and packing, $d_{sc-bp}$ , as follows:

[TABLE]

Given the difference between covering and packing, $d_{sc-bp}$ , we update the embedding of the instances of minimum vertex cover problem:

[TABLE]

Results: Figure 9 visualizes these vector operations in the latent MIP embedding space using PaCMAP in Figure 9-A and PCA in Figure 9-B. In both visualizations, notice how the embeddings of updated Vertex Cover move closer to the embeddings of Independent Set compared to their initial starting point, once they are modified by the direction obtained from the difference of embeddings of covering and packing instances.

Similarly, we further examine the vector arithmetic results for:

•

In Figure 10 above, $Set\ Cover-Vertex\ Cover+Independent\ Set=Bin\ Packing$ : Intuitively, this is can be understood as removing the ‘cover’ direction from Set Cover by subtracting Vertex Cover and adding the ‘packing’ direction from Independent Set to push the updated Set Cover instances closer to the Bin Packing instances from the initial Set Cover instances.

•

In Figure 11 below, $Independent\ Set-Set\ Packing+Set\ Cover=Vertex\ Cover$ : Intuitively, this is can be understood as removing the ‘packing’ direction from Independent Set by subtracting from Bin Packing and adding the ‘cover’ direction from Set Cover to push the updated Independent Set instances closer to the Vertex Cover instances from the initial Independent Set instances.

A.6 Details of Our Experimental Setup

For the experiments presented in §5, we train on the ml.g5.xlarge AWS instance with a GPU with 24 GB of memory. Inference experiments were run on the ml.c5.12xlarge instance with 48 cores and 96 GB of RAM. To ensure consistency and fairness, all experiments were executed with Gurobi (v12.0.3), a state-of-the-art commercial MIP solver (Gurobi Optimization, LLC, 2024), restricted to a single thread and a time limit of 3600 seconds. Unsupervised pre-training and integrality gap fine-tuning are run for 10 epochs with a learning rate of $10^{-4}$ , while the search guidance prediction task is trained for 25 epochs using a learning rate of $10^{-5}$ . The fixed radius in search guidance is set to 0.1.

A.7 Details of the Triplet Loss

Recall that Forge generates embedding vectors per instance and per variable/constraint. A straightforward method of predicting which variables are likely to be part of the solution is to treat it as a binary classification problem and use binary cross entropy (BCE) loss. However, this poses challenges due to the large class imbalance where most variables are not part of the solution.

Triplet Construction: To address this issue, we generate 5 feasible solutions for each instance. Variables in the instance are grouped by how many of the 5 solutions each variable was repeated in. A variable is considered a ‘negative’ variable if and only if it appears in none of the 5 solutions. Next, we use the standard triplet loss (Schroff et al., 2015) to fine tune the embeddings of the variables. The triplet loss is given by:

[TABLE]

Here, $a$ is the ‘anchor’ node, $p$ is the ‘positive’ node, $n$ is the ‘negative’ node and $d$ is a distance function (euclidean distance in our case). Triplet loss aims to minimize the distance between the ‘anchor’ and ‘positive’ nodes while ensuring the ‘negative’ node is at least ‘margin’ distance away from the ‘anchor’ node (margin is set to 2 in our case). In our setting, as shown in Figure 3(C), all variables appearing in the same number of solutions are treated as ‘positive’ and ‘anchor’ pairs. A key challenge of triplet loss is finding good negative nodes, as picking trivially negative nodes does not aid learning. We pick variables that have not appeared in any solution but are closest to the positive node in the unsupervised embedding space as negative nodes. Finally, the Forge model is fine-tuned to predict warm start variables using a combination of triplet loss and BCE loss.

A.8 Ablation Study on Search Guidance

In §5.2, as part of our comparison with state-of-the-art ML methods, we augmented PS-Gurobi (Han et al., 2023) with Forge embeddings and observed performance gains from their combination. More precisely, in PS-Gurobi, the input node features for constraints and variables are 4 and 6 dimensions, respectively. To minimize architectural changes, we applied Principal Component Analysis (PCA) to reduce the dimensionality of Forge embeddings from 1024 down to 64. This raises an important question: how much of the improvement is due to Forge versus the increased model capacity from higher-dimensional inputs? To control for this effect, we repeated the experiments from §5.2, augmenting PS-Gurobi with 64-dimensional random vectors. This comparison isolates the contribution of our pre-trained embeddings to overall performance. Each PS-Gurobi model for each problem type and augmentation was trained from scratch.

Figure 12 compares PS-Gurobi, PS-Gurobi augmented with Forge, and PS-Gurobi with random vectors. Our ablation shows that adding random vectors occasionally improves performance, likely due to the increase in input dimensionality, from 4 and 6 features for constraints and variables to 68 and 70, resulting in greater model capacity. However, semantic embeddings from Forge consistently dominate, delivering the strongest performance across both problem types.

A.9 Additional Related Work

Recent advances in applying machine learning to combinatorial optimization, particularly Mixed-Integer Programming (MIP), have led to a wealth of literature. These methods can be broadly categorized based on their focus areas, such as solver configuration, branching strategies, heuristic design, and generalization across problem types.

Learning-based Methods to Enhance MIP Solvers: Several works have explored integrating supervised and reinforcement learning into traditional MIP solving pipelines. The survey by Zhang et al. (2023) categorizes these methods into two main groups: those enhancing the Branch-and-Bound process (e.g., branching variable prediction (Kadioglu et al., 2012; Liberto et al., 2016; He et al., 2014b; Lodi & Zarpellon, 2017; Bengio et al., 2021a), cutting plane selection (Tang et al., 2020; Paulus et al., 2022), node selection He et al. (2014a)), and those improving heuristic algorithms such as Large Neighborhood Search, Feasibility Pump, and Predict-and-Pick. These models are typically trained end-to-end, with reinforcement learning often relying on imitation learning. We refer to (Gast Zepeda et al., 2025) for recent survey.

In the context of solver configuration, the earlier work of Kadioglu et al. (2010) propose the ISAC framework that takes a clustering-based approach to instance-specific algorithm configuration. It uses g-means to cluster problem instances and assigns configurations based on cluster membership. The method considers domain-specific features such as cost-density ratios and root cost metrics for problems such as SCP, MIP, and SAT. Algorithm selection, scheduling and portfolios have also been studied (Xu et al., 2008; Kadioglu et al., 2011a; Schede et al., 2025; Kemminer et al., 2024). More recently, Hosny & Reda (2024) propose predicting solver parameters by leveraging similarities between problem instances. Their key assumption is that instances with similar costs under one configuration will behave similarly under other configurations. Their features include pre-solve statistics and tree-based metrics, with a triplet loss guiding the learning process using solved instance objectives.

To improve generalization, Boisvert et al. (2024) propose a generic representation for combinatorial problems using abstract syntax trees with five node types - variables, constraints, values, operators and a model node. While expressive, this approach results in large, computationally expensive graphs.

Multi-Task and Generalist Models: Similar to our work here, efforts to unify learning across tasks and problem types have also emerged. Cai et al. (2025a) introduce a multi-task representation learning framework for MIP, training a shared backbone across tasks such as backdoor prediction and solver configuration prediction, followed by fine-tuning for specific problem types. Their method uses a bipartite graph representation, Graph Attention Networks (GAT), and contrastive loss, and is evaluated on problems such as CA, MVC, and (MIS). Similarly, Drakulic et al. (2024) present GOAL, a generalist agent for combinatorial optimization. It avoids GNNs, instead using mixed attention over edge and node matrices derived from bipartite graphs. Li et al. (2025) propose an LLM-based evolutionary framework that can generate a large set of diverse MIP classes and can be fine tuned to predict integrality gaps and branching nodes.

Graph Neural Networks for Branching and Heuristics: Graph-based representations have become standard for encoding MIP instances. One of the earliest work on graph-based learning by Gasse et al. (2019) uses GNNs to learn strong branching policies, introducing a bipartite graph structure and dual half-convolutions for message passing between constraints and variables. Chen et al. (2024) revisit GNN for MIPs and show that higher-order GNNs can overcome limitations identified via the 1-Weisfeiler-Lehman test, making all instances tractable for message passing. Cantürk et al. (2024) introduce improvements to the standard GNN workflow for CO so that they generalize on instances of a larger scale than those used in training and propose a two-stage primal heuristic strategy based on uncertainty quantification to automatically configure how solution search relies on the predicted decision values.

Along the lines of Backdoor learning, Cai et al. (2024a) use Monte Carlo Tree Search to identify effective backdoors, training a GAT to score variables. Ferber et al. (2022) propose pseudo-backdoors, using one model that characterizes if a subset of variables is a good backdoor and another model to predict whether prioritizing this subset would lead to a smaller run time.

Learning Heuristics and Large Neighborhood Search (LNS): Earlier works such as (Kadioglu et al., 2017; 2011b) proposes learning reactive restart and impact-based strategies to improve search. Recently, a growing body of work focuses on learning heuristics, particularly for Large Neighborhood Search (LNS). Huang et al. (2023) use expert heuristics to create training data followed by random perturbations to create ‘negative’ samples. Then contrastive learning is used to train GATs to predict node probabilities. Other works by Wu et al. (2021) and Song et al. (2020) use deep reinforcement learning to learn destroy operators or decompositions, with rewards based on objective improvements. Khalil et al. (2017) model the success of heuristics at specific nodes by examining instance based characteristics and use logistic regression over a rich feature set, including LP relaxation and scoring metrics. Cai et al. (2025b) propose Balans, an adaptive meta-solver for MIPs with online learning capability that does not require any supervision or apriori training. During the search, the selection among different neighborhood definitions is guided on the fly via multi-armed bandit algorithms (Strong et al., 2021; 2019).Yilmaz et al. (2025) extend Balans Cai et al. (2024b) using solver- and algorithmic-level parallelism into ParBalans to improve performance on challenging MIP instances.

Problem Specific Methods: The vehicle routing problem (VRP) has garnered special attention from the community (Berto et al., 2025; Hottung & Tierney, 2022).Zhou et al. (2023) introduce a meta-learning framework for VRPs, enabling generalization across problem sizes and distributions. Berto et al. (2024) explore ML solutions for different kinds of VRPs like those including backhauls, multi-depots, duration limits, mixed backhaul, line hauls, among others. They use a common encoder for all VRP types with global attributes for problem type and local node attributes to capture customer specific attributes such as location and demands.

Another problem that has garnered attention is the constraint satisfaction problem. Tönshoff et al. (2023) use GNNs to predict soft assignments, with reinforcement learning rewards based on constraint satisfaction improvements. Duan et al. (2022) propose a contrastive learning framework that generates label-preserving augmentations for SAT problems. These include techniques such as unit clause propagation, pure literal elimination, and clause resolution, ensuring that the satisfiability of the instance remains unchanged while enhancing the model’s robustness. Shafi et al. (2025) introduce Graph-SCP, a method that leverages features extracted from both bipartite and hypergraph representations of SCP instances. A GNN is then trained with these features to predict a promising subproblem where the optimal solution is likely to reside. This predicted subproblem is passed to a solver, effectively accelerating the overall solution process.

Unsupervised Approaches: Unsupervised learning has also been explored in various forms. Karalias & Loukas (2020) introduce a framework that learns a probability distribution over nodes, optimizing a loss that bounds the probability of finding a solution. These are then decoded using a derandomization process. Bu et al. (2024) build on the work by Karalias & Loukas (2020) by formalizing objective construction and derandomization strategies. They derive explicit formulations tailored to a range of combinatorial problems, including facility location, maximum coverage, and robust graph coloring. Sanokowski et al. (2024) provide an approach for solving combinatorial optimization problems without labeled data by leveraging diffusion models to sample from complex discrete distributions. Their method avoids the need for exact likelihoods by optimizing a loss that upper bounds the reverse KL divergence. While Forge falls in the domain of unsupervised approaches, we differ in that our goal is to not solve a given instance in an unsupervised manner, but rather to learn the graphical structure of various MIP instances in an unsupervised manner followed by supervised fine-tuning to aid in finding the solution.

Bibliography70

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bastubbe et al. (2025) Michael Bastubbe, Alexander Helber, Lukas Kirchhart, Marco Lübbecke, Niklas Rieken, and Jonas Witt. str I Plib: Structured Integer Programming Library. Unpublished, 2025. URL https://striplib.or.rwth-aachen.de .
2Bengio et al. (2021 a) Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon. European Journal of Operational Research , 290(2):405–421, 2021 a. ISSN 0377-2217. doi: https://doi.org/10.1016/j.ejor.2020.07.063 . URL https://www.sciencedirect.com/science/article/pii/S 0377221720306895 .
3Bengio et al. (2021 b) Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research , 290(2):405–421, 2021 b.
4Berto et al. (2024) Federico Berto, Chuanbo Hua, Nayeli Gast Zepeda, André Hottung, Niels Wouda, Leon Lan, Kevin Tierney, and Jinkyoo Park. Routefinder: Towards foundation models for vehicle routing problems. In ICML 2024 Workshop on Foundation Models in the Wild , 2024.
5Berto et al. (2025) Federico Berto, Chuanbo Hua, Nayeli Gast Zepeda, André Hottung, Niels Wouda, Leon Lan, Junyoung Park, Kevin Tierney, and Jinkyoo Park. Routefinder: Towards foundation models for vehicle routing problems, 2025. URL https://arxiv.org/abs/2406.15007 .
6Boisvert et al. (2024) Léo Boisvert, Hélène Verhaeghe, and Quentin Cappart. Towards a generic representation of combinatorial problems for learning-based approaches. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research , pp. 99–108, 2024.
7Bu et al. (2024) Fanchen Bu, Hyeonsoo Jo, Soo Yong Lee, Sungsoo Ahn, and Kijung Shin. Tackling prevalent conditions in unsupervised combinatorial optimization: Cardinality, minimum, covering, and more. In International Conference on Machine Learning , pp. 4696–4729, 2024.
8Cai et al. (2024 a) Junyang Cai, Taoan Huang, and Bistra Dilkina. Learning backdoors for mixed integer linear programs with contrastive learning. In European Conference on Artificial Intelligence , volume 392, pp. 2418–2425, 2024 a.