Mixed Integer Formulations
In this page, we give a quick overview of the mixed-integer formulations used to represent the various machine learning (ML) models supported by the package.
Throughout, we denote by \(x\) the input to the ML model (i.e. the independent variables) and by \(y\) the output of the ML model (i.e. the dependent variables).
For all examples we will consider only a single sample as input. That sample depending on the model will have multiple features and multiple outputs, being either continuous (regression) or discrete (classification). Inputting more samples, which can be done in all cases by this package, creates the same set of constraints per sample.
Argmax Formulation
These formulations are used in most cases of classification, i.e., when the output needs to be categorical. Different formulations are used depending on the dimension of the output \(y\). For all formulations of argmax there are potential issues when multiple classes have approximately the largest value. For these cases we will use \(\epsilon\) to explain when issues can arise, where by default in SCIP \(\epsilon = 10^{-6}\). Unlike standard argmax, instead of returning the index of the largest argument, we want the index of the largest argument to take value 1 and all other indices to take value 0. We denote \(y'\) as the output of the ML model before argmax has been applied.
In the scenario where \(y\) is single dimensional (which normally makes an argmax redundant), we interpret argmax as checking whether the value is greater than 0.5 or not.
This formulation has an issue of tolerances around 0.5. So values in the range \([0.5 - \epsilon, 0.5 + \epsilon]\) can all be either 0 or 1.
In the scenario where \(y\) is two-dimensional, we formulate argmax with indicator constraints
This formulation also has an issue with tolerances. If both \(y'_0\) and \(y'_1\) take values differing by less than \(\epsilon\), then either \(y_0\) or \(y_1\) can be selected by SCIP.
In the scenario where \(y\) is three-dimensional or more, we formulate argmax with SOS constraints. Let \(m \in \mathbb R\) be an additional variable used to find the maximum value, and let \(s \in \mathbb{R}^{n}_{\geq 0}\) be additional slack variables.
This formulation has an issue with tolerances when there are classes with output values less than \(\epsilon\) smaller than the largest output value. In such a case SCIP can select either the largest value label or any of those \(\epsilon\) close class labels as argmax.
For ease of notation, this constraint will simply be referred to as argmax(\(y'\))
Linear Regression
Denoting by \(\beta \in \mathbb R^{n+1}\) the computed weights of linear regression, its model takes the form
Since this is linear, it can be represented directly in SCIP using linear constraints. Note that the model fits other techniques such as Ridge, Lasso, and ElasticNet.
Logistic Regression
The standard logistic function, also referred to as the sigmoid function in some communities, is \(f(x) = \frac{1}{1 + e^{-x}}\).
In the case of regression, and a single dimensional output, the logistic regression formulation is
In the case of regression with multi-dimensional output, the regression formulation depends on scikit-learn. This can change depending on user defined parameters within the framework. The two potential formulations are
In the case of classification we avoid modelling the non-linearities. For \(y\) being single dimension we formulate logistic regression for classification with
In the case of multi-class classification we formulate logistic regression using the argmax formulation
Neural Networks
The package currently models dense neural network with ReLU, Sigmoid, and Tanh activations. For all formulations we let i be the node index of the input layer and j be node index of the out layer.
For dense layers with a ReLU activation function, we introduce slack variables \(s \in \mathbb{R}^{n}_{\geq 0}\), with the formulation of the layer given by:
Note that this formulation is non-standard in the literature. The standard formulation uses big-M constraints, for which bounds are found through feasibility and optimality based bound tightening procedures. Empirically such formulations have been shown to be the current state-of-the-art. Such a formulation fails completely, however, when the big-M values becomes sufficiently large, and is less friendly numerically overall w.r.t. the difference between the true output of the predictor and that which SCIP returns. Therefore we have decided to use SOS1 constraints, which will likely be slower on well-scaled neural networks.
An option to use the traditional big-M formulation is provided if one embeds the neural network predictor with the argument formulation=bigm. Warning: When using this formulation, the user must provide appropriate input bounds. The formulation for the big-M model introduces binary variables \(z \in \{0,1\}^{n}\), and performs a propagation routine to obtain lower and upper bound of the output neurons denoted by \(L, U \in \mathbb{R}^{n}\). The formulation of the layer is given by:
For dense layers with a Sigmoid activation function the formulation is:
For dense layers with a Tanh activation function the formulation is:
For dense layers with a Softmax activation function the formulation is:
For dense layers with a Softplus activation function the formulation is:
As the maximum is preserved over all these activation functions, and other activation functions such as Softmax, the inserted predictor constraint for classification purposes does not explicitly model the final activation layer. In such a case the formulation used is:
Decision Tree
In a decision tree, each leaf \(l\) is defined by a number of constraints on the input features of the tree that correspond to the branches taken in the path leading to \(l\). We formulate decision trees by introducing one binary decision variable \(\delta_l\) for each leaf of the tree.
In the decision tree exactly one leaf is chosen. This constraint is formulated as:
To ensure that the input vector maps to the correct leaf, however, we need to introduce additional notation and constraints. For a node \(v\), we denote by \(i_v\) the feature used for splitting and by \(\theta_v\) the value at which the split is made. At a leaf \(l\) of the tree, we have a set \(\mathcal L_l\) of inequalities of the form \(x_{i_v} \le \theta_v\) corresponding to the left branches leading to \(l\) and a set \(\mathcal R_l\) of inequalities of the form \(x_{i_v} > \theta_v\) corresponding to the right branches.
For each leaf, the inequalities describing \(\mathcal L_l\) and \(\mathcal R_l\) are imposed using indicator constraints:
In our implementation, \(\epsilon\) can be specified by a keyword parameter epsilon in functions that add a decision tree constraint. By default the value for \(\epsilon\) is 0. When \(\epsilon\) is smaller than the default tolerance in SCIP (as it is by default), and you have a solution where \(x_{i_v} \approx \theta_v\), then SCIP can select an arbitrary child node of that decision in the tree.
Here is a concrete example. Let an internal node of the decision tree be for the feature \(x_4\) and value 5. Then the decisions are:
When \(x_4 \approx 5\), both these conditions are true for SCIP, and therefore both child nodes can be reached. The result is that for a value of \(x_4 = 4.999999999\), SCIP could say that \(x_4 \geq 5\), and then the output of SCIP can be drastically different to that which is returned by decision_tree.predict(). The purpose of \(\epsilon\) is to break these ties, and enforce that only one of the decision can ever be true. The downside is that it introduces a small area of model infeasibility. For instance, if \(\epsilon = 0.001\), and the only solution to the above example is \(x_4 = 5\), then that solution is no longer valid according to the formulation. Therefore, we warn users to be careful when setting \(\epsilon\) to be non-zero.
When using decision trees for classification, we create constraints that ensure the correct class is selected depending on the leaf node. Let \(y_j\) be the output for class j, and \(L_j\) be the set of leaf nodes that predict class j. The constraint ensuring the class is selected according to the leaf node is:
Random Forests
The formulation of Random Forests is a linear combination (aggregation) of decision trees. Each decision tree is represented using the model above. The same difficulties with the choice of \(\epsilon\) apply to this case.
In the case of classification, after the linear combination (aggregation) is performed, the output is piped through the argmax formulation.
Gradient Boosting Trees
The formulation of Gradient Boosting Trees is a linear combination (aggregation) of decision trees. Each decision tree is represented using the model above. The same difficulties with the choice of \(\epsilon\) apply to this case.
In the case of classification, after the linear combination (aggregation) is performed, the output is piped through the argmax formulation.
Support Vector Machines
For support vector machines, currently only linear and polynomial kernels are supported. In addition, currently only binary classification is supported, as modelling the sklearn one-vs-all relation is non-trivial.
The formulation for a linear kernel is simply a linear regression model. That is:
The formulation for a polynomial kernel requires introducing some additional notation. Let \(S\) be the set of support vectors, \(d\) be the degree of the polynomial kernel, \(\Lambda \in \mathbb{R}^{S}\) be the dual coefficients, and \(v \in \mathbb{R}^{n \times S}\) be the support vectors. Then the result of the regression function is:
As \(|S|\) is not known until after training, it is possible that the embedded models for polynomial kernels can be much larger than the feature sizes would suggest. Hence, the MIP model may become quite large and difficult to optimise.
In the case of classification, the output of the regression function is piped into the argmax formulation centred around 0.
Centroid Clustering
The formulation for centroid clustering is as follows, where \(d_k\) are the distance variables between the input vector and cluster \(k\) and \(C \in \mathbb{R}^{n \times K}\) are the cluster centers:
The argmin function in practice is accomplished by using argmax on the negative variables.
The above formulation is non-linear. A linearised version that uses the L1 norm instead of the L2 norm is also provided, although it should be noted that points can be misclassified when using this formulation as it is an approximation. Let \(pos_{i,k} \geq 0\) and \(neg_{i,k} \geq 0\) be variables whose sum is the L1 distance in a dimension to a given centroid.