Instability and the Geometry of Optimization Problems#

As we have seen, whenever we solve a problem numerically, we have to accept that the input we provide and the output we obtain may differ from the theoretical or mathematical solution to the given problem. For example, \(0.1\), in a computer, will be represented by a number that differs from \(0.1\) by about \(10^{-17}\). Thus, a natural thing to worry about is if these small differences may induce large differences in the computed solution.

This is the idea behind the notion of the Condition Number for a given problem. While it is true that for most practical optimization problems, small perturbations in the input only induce small perturbations in the final answer to the problem, there are some special situations where this is not the case. These ill behaving problems are called Ill Conditioned or Numerically Unstable.

This sections aims to show, in the context of linear optimization problems, the most common sources for this behavior, and also how to avoid the behavior altogether. We will review first the problem of solving linear systems with unique solutions, and then move into the more central issue of linear optimization problems, its geometric interpretation, and then describe some of the most common bad cases. We then provide two thought experiments with interactive material to help illustrate the concepts of this section. We conclude with some further thoughts on this topic.

Note that although the notion of the Condition Number has received a lot of attention from the academic community, reviewing this literature is beyond the scope of this document. If you want to start looking into this topic, a good entry point can be the Condition Number page at Wikipedia.

The Case of Linear Systems#

Solving linear systems is a very common sub-routine in any MI(QC)P-solver, as we have to solve many linear systems during the full execution of the algorithm.

So, consider that we have a linear system \(Ax=b\) with an unique solution (i.e. \(A\) is a square matrix with full rank), and you want to evaluate how the solution to the system might change if we perturb the right-hand side \(b\). Since the system has a unique solution, we know that given \(b\), the solution will be \(A^{-1}b\), and if we perturb \(b\) with \(\varepsilon\), the solution will be \(A^{-1}(b+\varepsilon)\). A measure for the relative change in the solution with respect to the relative change in the input would be the ratio

\[\eta(b,\varepsilon):=\frac{\|A^{-1}b\|}{\|A^{-1}(b+\varepsilon)\|}/\frac{\|b\|}{\|b+\varepsilon\|}.\]

Note that the above definition is independent of the magnitudes of \(b\) and \(\varepsilon\). From there, the worst possible ratio would be the result of

\[\kappa(A):=\max_{b,\varepsilon}\eta(b,\varepsilon).\]

This quantity is known as the condition number of the matrix \(A\) and is defined as

\[\kappa(A)={\|A\|}{\|A^{-1}\|}.\]

A common interpretation of \(\kappa(A)=10^k\) is that, when solving the system \(Ax=b\), you may lose up to \(k\) digits of accuracy in \(x\) from the accuracy you have in \(b\).

The condition number for the optimal simplex basis in an LP is captured in the KappaExact attribute. A very large \(\kappa\) value might be an indication that the result might be unstable.

When this is indeed the case, the best advice is to scale the constraint matrix coefficients so that the resulting range of coefficients is small. This transformation will typically reduce the \(\kappa\) value of the final basis; please refer to the Tolerances and User-Scaling section for a discussion on how to perform this rescaling, and also for caveats on scaling in general.

The Geometry of Linear Optimization Problems#

Before showing optimization models that exhibit bad behavior, we first need to understand the geometry behind them. Consider a problem of the form

\[\begin{split}\begin{array}{ll} \max & cx\\ s.t. & Ax \leq b.\\ \end{array}\end{split}\]

For example:

\[\begin{split}\begin{array}{lrrl} \max & x + y & \vec{c}=&(1,1)\\ s.t. & -x \leq 0 & A_{1\cdot}=&(-1,0)\\ & x \leq 1 & A_{2\cdot}=&(1,0)\\ & -y \leq 0 & A_{3\cdot}=&(0,-1)\\ & y \leq 1 & A_{4\cdot}=&(0,1).\\ \end{array}\end{split}\]

Note that if we denote \(b^t:=(0,1,0,1)\), then the problem can be stated as

\[\max_{x\in\mathbb{R}^2}\{ \vec{c}x:Ax\leq b\}.\]

The feasible region, direction of improvement \(\vec{c}\), and optimal solution \(x^*\) can be depicted as

../../_images/codedraw3.svg
../../_images/codedraw3-dark.svg

Note that whenever we move in the direction of \(\vec{c}\), the value \(\vec{c}x\) increases. Furthermore, since we can not move from \(x^*\) to another feasible point with better objective value, we can conclude that \(x^*\) is indeed the optimal solution for the problem. Note that \(x^*\) is a corner point of the feasible region. This is not a coincidence; you will always find an optimal solution at a corner point if the feasible region is bounded and \(\vec{c}\) is not zero. If the objective is zero then all feasible solutions are optimal; we will talk more about zero objectives and their implications later.

To understand how changes in the input data affect the feasible region and the optimal solution, consider a small modification: \(\tilde{b}^t=(\varepsilon,1,0,1)\), \(\tilde{\vec{c}}=(1+\varepsilon,1)\), and \(\tilde{A_{4\cdot}}=(\varepsilon,1)\). Then our optimization problem would look like

../../_images/codedraw4.svg
../../_images/codedraw4-dark.svg

Note that although we changed the right-hand side, this change had no effect in the optimal solution to the problem, but it did change the feasible region by enlarging the bottom part of the feasible area.

Changing the objective vector tilts the corresponding vector in the graphical representation. This of course also changes the optimal objective value. Perturbing a constraint tilts the graphical representation of the constraint. The change in \(A_{4\cdot}\) changes the primal solution itself. The amount of tilting constraint undergoes depends on the relative value of the perturbation. For example, although the constraint \(x \leq 1\) and the constraint \(100 x \leq 100\) induce the same feasible region, the perturbation \(x + \varepsilon y\leq 1\) will induce more tilting that the perturbation \(100 x + \varepsilon y \leq 100\).

Multiple Optimal Solutions#

A common misconception among beginners in optimization is the idea that optimization problems really have just one solution. Surprisingly, this is typically not true. For many practical problems, the objective (whether it is cost or revenue or …) is dominated by a handful of variables, while most variables are just there to ensure that the actual operation of the solution is possible. Consider a staffing problem, for example, where cost is typically driven by the number of people who work on a given day, not by the specific people.

These kind of situations naturally lead to problems similar to

\[\begin{split}\begin{array}{lrrl} \max & y & \vec{c}=&(0,1)\\ s.t. & -x \leq 0 & A_{1\cdot}=&(-1,0)\\ & x \leq 1 & A_{2\cdot}=&(1,0)\\ & -y \leq 0 & A_{3\cdot}=&(0,-1)\\ & y \leq 1 & A_{4\cdot}=&(0,1).\\ \end{array}\end{split}\]

Graphically this can be depicted as

../../_images/codedraw5.svg
../../_images/codedraw5-dark.svg

In this situation is clear that \(x^1\), \(x^3\), and all solutions lying on the line between these two points are optimal. The simplex algorithm will return either \(x^1\) or \(x^3\) (and may switch if you change parameters). The barrier algorithm (without crossover) will return \(x^2\). These solutions are all correct; the problem as stated has no reason to prefer one over the other. If you do have a preference, you’ll need to state it in your objective function.

Dealing with Epsilon-Optimal Solutions#

The previous section considered the case of multiple (true) optimal solutions. What happens when we have several \(\varepsilon\)-optimal solutions? To be more specific, consider

\[\begin{split}\begin{array}{lrrl} \max & \varepsilon x + y & \vec{c}=&(\varepsilon,1)\\ s.t. & -x \leq 0 & A_{1\cdot}=&(-1,0)\\ & x \leq 1 & A_{2\cdot}=&(1,0)\\ & -y \leq 0 & A_{3\cdot}=&(0,-1)\\ & y \leq 1 & A_{4\cdot}=&(0,1).\\ \end{array}\end{split}\]

Graphically this can be depicted as

../../_images/codedraw6.svg
../../_images/codedraw6-dark.svg

If \(\varepsilon\) is zero, then we are in the situation described before. Note, however, that a small perturbation of the objective vector may lead to either \(x^1\) or \(x^2\) being reported as optimal. And tolerances can play a big role here. If \(\varepsilon\) is negative, for example, then \(x^1\) would be the mathematically optimal result, but due to the optimality tolerance, simplex might conclude that \(x^2\) is optimal. More precisely, if \(\varepsilon\) is less than the default optimality tolerance of \(10^{-6}\), then simplex is free to declare either solution optimal (within tolerances).

The above statement is true whenever the distance between \(x^1\) and \(x^2\) is not too large. To see this, consider what happens when we change the right-hand side of \(A_{4\cdot}\) from 1 to \(10^6\). Then the feasible region would then be a very long rectangular box, with vertices \((0,0)\), \((0,1)\), \((10^6,1)\) and \((10^6,0)\). Perhaps somewhat surprisingly, if \(\varepsilon\) is below the dual tolerance, simplex may consider \((10^6,1)\) optimal, even though its objective value is \(1-10^6\varepsilon\), which can be very relevant in terms of the final objective value.

Note that both situations share one ingredient: The objective function is (almost) parallel to one of the sides of the feasible region. In the first case, this side is relatively short, and thus jumping from \(x^2\) to \(x^1\) translate into a small change in objective value. In the second case, the side almost parallel to the objective function is very long, and now the jump from \(x^2\) to \(x^1\) can have a significant impact on the final objective function.

If you take out either of these two ingredients, namely the objective vector being almost parallel to a constraint, or the edge induced by this nearly-parallel constraint being very long, then this problem can not arise. For the reasons discussed at the beginning of this section, it is common for the objective function to be close to parallel to one or more constraints. Thus, the best way to avoid this situation is to avoid the second condition. The simplest way to do this is to ensure that the ranges for your variables are not too large. Please refer to the Tolerances and User-Scaling section for guidance on this.

Thin Feasible Regions#

We now consider another extreme situation that can lead to unexpected results. Consider the problem defined as

\[\begin{split}\begin{array}{lrrl} \max & y & \vec{c}=&(0,1)\\ s.t. & -x + \varepsilon y \leq 1 & A_{1\cdot}=&(-1,\varepsilon)\\ & x + \varepsilon y \leq 1 & A_{2\cdot}=&(1,\varepsilon)\\ & -y \leq 0 & A_{3\cdot}=&(0,-1)\\ \end{array}\end{split}\]

and its graphical representation

../../_images/codedraw7.svg
../../_images/codedraw7-dark.svg

It is clear from the graphical representation that the optimal solution for the problem will be at the intersection of constraints \(A_{1\cdot}\) with \(A_{2\cdot}\); and if we do the algebra, we will get that \(x^*=(0,\frac{1}{\varepsilon})\). Also note that as you decrease \(\varepsilon\) the feasible region stretches upwards, leaving its base unchanged. We will consider the case where \(\varepsilon\) is a very small, positive number (between \(10^{-9}\) and \(10^{-6}\)).

If we perturb the right-hand side vector \(b\) from \((1,1)\) to \((1+\delta,1)\), the new solution will be \(\tilde{x}^*=(-\frac{\delta}{2},\frac{2+\delta}{2\varepsilon})\). To assess the impact of this perturbation, we compute the \(L_1\) distance between this modified solution and the previous solution, which is given by

\[\|x^*-\tilde{x}^*\|_1 = \frac{|\delta|}{2}+\frac{|\delta|}{\varepsilon}\]

This quantity can be either small or very large, depending on the relative magnitude between \(\delta\) and \(\varepsilon\). If \(\delta\) is much smaller than \(\varepsilon\), then this quantity will be small. However, if \(\delta\) is larger than or even the same order of magnitude as \(\varepsilon\), the opposite will be true. Very small perturbations in the input data can lead to big changes in the optimal solution.

A similar issue arises if we perturb \(A_{1\cdot}\) to \((-1,\delta)\); the new optimal solution becomes \(\tilde{x}^*=(1-\frac{2\varepsilon}{\varepsilon+\delta},\frac{2}{\varepsilon+\delta})\). But now, if \(\delta=\varepsilon/2\), then the new solution for \(y\) will change from \(\frac{1}{\varepsilon}\) to \(\frac{4}{3\varepsilon}\) (a 33% relative difference). Again, small changes in the input can produce big changes in the optimal solution.

What is driving this bad behavior? The problem is that the optimal point is defined by two constraints that are nearly parallel. The smaller \(\varepsilon\) is, the closer to parallel the are. When the constraints are so close parallel, small changes in the slopes can lead to big movements in the point where they intersect. Mathematically speaking:

\[\lim_{\varepsilon\rightarrow0^+}\|x^*\|=\infty\]

Note however that, if the original problem had an additional variable bound of the form \(y \leq 10^4\), then neither of these bad behavior would have been possible. For any \(\varepsilon\) value smaller than \(10^{-4}\), the optimal point would be defined by the new constraint and one of the constraints \(A_{2\cdot}\) or \(A_{1\cdot}\), which would lead again to a well-behaved (i.e. stable) solutions. In summary, this sort of issue can only arise when either the feasible region is either unbounded or very large. See the Tolerances and User-Scaling section for further guidance on bounding the feasible region.

Optimizing Over the Circle#

Now we provide our first thought experiment: Consider the problem of optimizing a linear function over the feasible region defined by the constraints

\[\sin(2\pi\frac{i}{10^6}) x + \cos(2\pi\frac{i}{10^6}) y \leq 1,\,\forall i\in\{1,\ldots,10^6\},\]

i.e. the feasible region is essentially a unit circle in \(\mathbb{R}^2\). Note that for all objective functions, the corresponding optimal point will be defined by two linear constraints that are very close to be parallel. What will happen to the numerical solution to the problem? Can you guess? The situation is depicted in the figure below:

../../_images/codedraw1.svg
../../_images/codedraw1-dark.svg

To perform the experiment, we execute the code circleOpt.py, where we randomly select an objective vector, find the optimal solution to the resulting optimization problem, and compute several relevant quantities:

  • The worst distance between the reported primal solution, and the theoretical solution to the problem of actually optimizing over a perfect circle, over all previous runs.

  • The worst bound violation reported by Gurobi over all previous runs.

  • The worst constraint violation reported by Gurobi over all previous runs.

  • The worst dual violation reported by Gurobi over all previous runs.

  • The number of previous experiments.

  • Accumulated number of simplex iterations.

  • The \(\kappa\) (KappaExact attribute) value for the current optimal basis.

Sample output is shown below:

Added 2 Vars and 1048576 constraints in 19.19 seconds
Errors: 8.65535e-08 0 2.94137e-07 2.77556e-17 Iter 0 10 Kappa 3150.06
Errors: 4.81978e-07 0 3.22359e-07 2.77556e-17 Iter 1 21 Kappa 3009.12
Errors: 4.81978e-07 0 3.4936e-07 1.11022e-16 Iter 2 33 Kappa 2890.58
Errors: 1.53201e-06 0 9.78818e-07 1.11022e-16 Iter 6 79 Kappa 1727.89
Errors: 1.61065e-06 0 8.26005e-07 1.11022e-16 Iter 46 536 Kappa 1880.73
Errors: 1.61065e-06 0 8.84782e-07 1.11022e-16 Iter 52 602 Kappa 1817.27
Errors: 1.61065e-06 0 9.4557e-07 1.11022e-16 Iter 54 625 Kappa 1757.96
Errors: 1.69167e-06 0 9.78818e-07 1.11022e-16 Iter 64 742 Kappa 1727.89
Errors: 1.69167e-06 0 3.8268e-07 1.66533e-16 Iter 88 1022 Kappa 2761.99
Errors: 1.69167e-06 0 9.04817e-07 1.66533e-16 Iter 92 1067 Kappa 1797.06
Errors: 1.69167e-06 0 2.94137e-07 2.22045e-16 Iter 94 1089 Kappa 3150.06
Errors: 1.69167e-06 0 3.29612e-07 2.22045e-16 Iter 95 1101 Kappa 2975.84
Errors: 1.69167e-06 0 3.4936e-07 2.22045e-16 Iter 98 1137 Kappa 2890.58
Errors: 1.69167e-06 0 9.25086e-07 2.22045e-16 Iter 99 1147 Kappa 1777.3
Errors: 1.69167e-06 0 9.78818e-07 2.22045e-16 Iter 107 1237 Kappa 1727.89
Errors: 1.69167e-06 0 9.99895e-07 2.22045e-16 Iter 112 1293 Kappa 1709.61
Errors: 1.84851e-06 0 9.78818e-07 2.22045e-16 Iter 132 1523 Kappa 1727.89
Errors: 1.96603e-06 0 9.99895e-07 2.22045e-16 Iter 134 1545 Kappa 1709.61

Surprisingly the reported errors are rather small. Why is this? There are at least two contributing factors: the model has a bounded feasible region (in this case the range of both variables is \([-1,1]\)). In addition, the distance from one extreme point (a point at the intersection of two neighboring constraints) to its neighbor is also relatively small, so all \(\varepsilon\)-optimal solutions are close to each other.

We encourage you to play with this code, perturb some of the input data, and analyze the results. You will see the discrepancies between the theoretical and the numerical optimal solution will be comparable to the sizes of the perturbations.

Optimizing Over Thin Regions#

Now we move to our second thought experiment: Consider a feasible region consisting of a triangle in \(\mathbb{R}^2\) with a very wide base and very short height, as depicted here:

../../_images/codedraw2.svg
../../_images/codedraw2-dark.svg

Consider the case where the ratio of the base to the height is on the order of \(10^5\), and that we consider a nominal objective function \(\vec{c}_1\) as in the figure.

In theory, the optimal solution should be the apex of the triangle, but assume that we randomly perturb both the right-hand side and the objective function with terms in the order of \(10^{-6}\). What will happen with the numerical solution?

To perform the experiment, we execute the code thinOpt.py, where we perform a series of re-optimizations with different perturbations as described above. To be more precise, whenever the new computed solution is further from the mathematical solution by more than it has been in previous trials, we print:

  • The new maximum distance among solutions.

  • The current iteration.

  • The \(\kappa\) (KappaExact attribute) value for the current optimal basis.

  • The bound violation as reported by Gurobi for the current solution.

  • The constraint violation as reported by Gurobi for the current solution.

  • The dual violation as reported by Gurobi for the current solution.

Sample output is shown below:

New maxdiff 4e+16 Iter 0 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 1 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 2 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 7 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 83 Kappa 3.31072 Violations: 0 0 2.64698e-23
New maxdiff 4e+16 Iter 194 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 1073 Kappa 3.31072 Violations: 0 1.13687e-13 0
New maxdiff 4e+16 Iter 4981 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 19514 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 47117 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 429955 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 852480 Kappa 3.31072 Violations: 0 0 0

Results look very different from what we saw in our first test. The distance between the solution to the unperturbed model and the solution to the perturbed one is huge, even from the very first iteration. Also, the \(\kappa\) values are relatively small, and the reported primal, dual, and bound violations are almost zero. So, what happened? Note that when we choose \(\vec{c}_1=(0,1)\), we are choosing an optimal point where a small tilting of the objective function may move us to another extreme point very far away, and hence the large norm. This is possible because the region is very large and, in principle, without any bounds, i.e. this is related to the case of \(\varepsilon\)-optimal solutions and very long sides.

Again, we encourage you to play with this example. For example, what would happen if the nominal objective function is \(\vec{c}_2=(1,0)\)?

Stability and Convergence#

The algorithms used to solve linear programming problems are all forced to make an assumption: that tiny changes to the system (e.g., making a small step in barrier) lead to small changes in the solution. If this is not true (due to ill-conditioning), then the algorithm may jump around in the solution space and have a hard time converging.

Finally, one way to improve the geometry of a problem is by suitably scaling variables and constraints as explained in the Tolerances and User-Scaling section, and working with bounded feasible sets with reasonable ranges for all variables.