# Instability and the Geometry of Optimization Problems#

As we have seen, whenever we solve a problem numerically, we have to
accept that the input we provide and the output we obtain may differ
from the *theoretical* or *mathematical* solution to the given problem.
For example, \(0.1\), in a computer, will be represented by a number
that differs from \(0.1\) by about \(10^{-17}\). Thus, a natural
thing to worry about is if these small differences may induce large
differences in the computed solution.

This is the idea behind the notion of the *Condition Number* for a given
problem. While it is true that for most practical optimization problems,
small perturbations in the input only induce small perturbations in the
final answer to the problem, there are some special situations where
this is not the case. These ill behaving problems are called *Ill
Conditioned* or *Numerically Unstable*.

This sections aims to show, in the context of linear optimization problems, the most common sources for this behavior, and also how to avoid the behavior altogether. We will review first the problem of solving linear systems with unique solutions, and then move into the more central issue of linear optimization problems, its geometric interpretation, and then describe some of the most common bad cases. We then provide two thought experiments with interactive material to help illustrate the concepts of this section. We conclude with some further thoughts on this topic.

Note that although the notion of the *Condition Number* has received a
lot of attention from the academic community, reviewing this literature
is beyond the scope of this document. If you want to start looking into
this topic, a good entry point can be the Condition
Number page at
Wikipedia.

## The Case of Linear Systems#

Solving linear systems is a very common sub-routine in any MI(QC)P-solver, as we have to solve many linear systems during the full execution of the algorithm.

So, consider that we have a linear system \(Ax=b\) with an unique
solution (i.e. \(A\) is a square matrix with full rank), and you
want to evaluate how the solution to the system might change if we
perturb the right-hand side \(b\). Since the system has a unique
solution, we know that given \(b\), the solution will be
\(A^{-1}b\), and if we perturb \(b\) with \(\varepsilon\),
the solution will be \(A^{-1}(b+\varepsilon)\). A measure for the
*relative* change in the solution with respect to the *relative* change
in the input would be the ratio

Note that the above definition is independent of the magnitudes of
\(b\) and \(\varepsilon\). From there, the *worst* possible
ratio would be the result of

This quantity is known as the condition number of the matrix \(A\) and is defined as

A common interpretation of \(\kappa(A)=10^k\) is that, when solving the system \(Ax=b\), you may lose up to \(k\) digits of accuracy in \(x\) from the accuracy you have in \(b\).

The condition number for the optimal simplex basis in an LP is captured
in the KappaExact attribute. A very large \(\kappa\)
value *might* be an indication that the result might be unstable.

When this is indeed the case, the best advice is to scale the constraint matrix coefficients so that the resulting range of coefficients is small. This transformation will typically reduce the \(\kappa\) value of the final basis; please refer to the Tolerances and User-Scaling section for a discussion on how to perform this rescaling, and also for caveats on scaling in general.

## The Geometry of Linear Optimization Problems#

Before showing optimization models that exhibit bad behavior, we first
need to understand the *geometry* behind them. Consider a problem of the
form

For example:

Note that if we denote \(b^t:=(0,1,0,1)\), then the problem can be stated as

The feasible region, direction of improvement \(\vec{c}\), and optimal solution \(x^*\) can be depicted as

Note that whenever we move in the direction of \(\vec{c}\), the
value \(\vec{c}x\) increases. Furthermore, since we can not move
from \(x^*\) to another feasible point with better objective value,
we can conclude that \(x^*\) is indeed the optimal solution for the
problem. Note that \(x^*\) is a *corner* point of the feasible
region. This is not a coincidence; you will always find an optimal
solution at a corner point if the feasible region is bounded and
\(\vec{c}\) is not zero. If the objective is zero then all feasible
solutions are optimal; we will talk more about zero objectives and their
implications later.

To understand how changes in the input data affect the feasible region and the optimal solution, consider a small modification: \(\tilde{b}^t=(\varepsilon,1,0,1)\), \(\tilde{\vec{c}}=(1+\varepsilon,1)\), and \(\tilde{A_{4\cdot}}=(\varepsilon,1)\). Then our optimization problem would look like

Note that although we changed the right-hand side, this change had no effect in the optimal solution to the problem, but it did change the feasible region by enlarging the bottom part of the feasible area.

Changing the objective vector tilts the corresponding vector in the
graphical representation. This of course also changes the optimal
objective value. Perturbing a constraint tilts the graphical
representation of the constraint. The change in \(A_{4\cdot}\)
changes the primal solution itself. The amount of *tilting* constraint
undergoes depends on the relative value of the perturbation. For
example, although the constraint \(x \leq 1\) and the constraint
\(100 x \leq 100\) induce the same feasible region, the perturbation
\(x + \varepsilon y\leq 1\) will induce more tilting that the
perturbation \(100 x + \varepsilon y \leq 100\).

## Multiple Optimal Solutions#

A common misconception among beginners in optimization is the idea that
optimization problems really have just one solution. Surprisingly, this
is typically not true. For many practical problems, the objective
(whether it is cost or revenue or …) is dominated by a handful of
variables, while most variables are just there to ensure that the actual
*operation* of the solution is possible. Consider a staffing problem,
for example, where cost is typically driven by the number of people who
work on a given day, not by the specific people.

These kind of situations naturally lead to problems similar to

Graphically this can be depicted as

In this situation is clear that \(x^1\), \(x^3\), and all solutions lying on the line between these two points are optimal. The simplex algorithm will return either \(x^1\) or \(x^3\) (and may switch if you change parameters). The barrier algorithm (without crossover) will return \(x^2\). These solutions are all correct; the problem as stated has no reason to prefer one over the other. If you do have a preference, you’ll need to state it in your objective function.

## Dealing with Epsilon-Optimal Solutions#

The previous section considered the case of multiple (true) optimal solutions. What happens when we have several \(\varepsilon\)-optimal solutions? To be more specific, consider

Graphically this can be depicted as

If \(\varepsilon\) is zero, then we are in the situation described
before. Note, however, that a small perturbation of the objective vector
may lead to either \(x^1\) or \(x^2\) being reported as optimal.
And tolerances can play a big role here. If \(\varepsilon\) is
negative, for example, then \(x^1\) would be the mathematically
optimal result, but due to the *optimality tolerance*, simplex might
conclude that \(x^2\) is optimal. More precisely, if
\(\varepsilon\) is less than the default optimality tolerance of
\(10^{-6}\), then simplex is free to declare either solution optimal
(within tolerances).

The above statement is true whenever the *distance* between \(x^1\)
and \(x^2\) is not too large. To see this, consider what happens
when we change the right-hand side of \(A_{4\cdot}\) from 1 to
\(10^6\). Then the feasible region would then be a very long
rectangular box, with vertices \((0,0)\), \((0,1)\),
\((10^6,1)\) and \((10^6,0)\). Perhaps somewhat surprisingly, if
\(\varepsilon\) is below the dual tolerance, simplex may consider
\((10^6,1)\) optimal, even though its objective value is
\(1-10^6\varepsilon\), which can be very relevant in terms of the
final objective value.

Note that both situations share one ingredient: The objective function is (almost) parallel to one of the sides of the feasible region. In the first case, this side is relatively short, and thus jumping from \(x^2\) to \(x^1\) translate into a small change in objective value. In the second case, the side almost parallel to the objective function is very long, and now the jump from \(x^2\) to \(x^1\) can have a significant impact on the final objective function.

If you take out either of these two ingredients, namely the objective
vector being almost parallel to a constraint, or the *edge* induced by
this nearly-parallel constraint being very long, then this problem can
not arise. For the reasons discussed at the beginning of this section,
it is common for the objective function to be close to parallel to one
or more constraints. Thus, the best way to avoid this situation is to
avoid the second condition. The simplest way to do this is to ensure
that the ranges for your variables are not too large. Please refer to
the Tolerances and User-Scaling
section for guidance on this.

## Thin Feasible Regions#

We now consider another extreme situation that can lead to unexpected results. Consider the problem defined as

and its graphical representation

It is clear from the graphical representation that the optimal solution for the problem will be at the intersection of constraints \(A_{1\cdot}\) with \(A_{2\cdot}\); and if we do the algebra, we will get that \(x^*=(0,\frac{1}{\varepsilon})\). Also note that as you decrease \(\varepsilon\) the feasible region stretches upwards, leaving its base unchanged. We will consider the case where \(\varepsilon\) is a very small, positive number (between \(10^{-9}\) and \(10^{-6}\)).

If we perturb the right-hand side vector \(b\) from \((1,1)\) to \((1+\delta,1)\), the new solution will be \(\tilde{x}^*=(-\frac{\delta}{2},\frac{2+\delta}{2\varepsilon})\). To assess the impact of this perturbation, we compute the \(L_1\) distance between this modified solution and the previous solution, which is given by

This quantity can be either small or very large, depending on the relative magnitude between \(\delta\) and \(\varepsilon\). If \(\delta\) is much smaller than \(\varepsilon\), then this quantity will be small. However, if \(\delta\) is larger than or even the same order of magnitude as \(\varepsilon\), the opposite will be true. Very small perturbations in the input data can lead to big changes in the optimal solution.

A similar issue arises if we perturb \(A_{1\cdot}\) to \((-1,\delta)\); the new optimal solution becomes \(\tilde{x}^*=(1-\frac{2\varepsilon}{\varepsilon+\delta},\frac{2}{\varepsilon+\delta})\). But now, if \(\delta=\varepsilon/2\), then the new solution for \(y\) will change from \(\frac{1}{\varepsilon}\) to \(\frac{4}{3\varepsilon}\) (a 33% relative difference). Again, small changes in the input can produce big changes in the optimal solution.

What is driving this bad behavior? The problem is that the optimal point is defined by two constraints that are nearly parallel. The smaller \(\varepsilon\) is, the closer to parallel the are. When the constraints are so close parallel, small changes in the slopes can lead to big movements in the point where they intersect. Mathematically speaking:

Note however that, if the original problem had an additional variable bound of the form \(y \leq 10^4\), then neither of these bad behavior would have been possible. For any \(\varepsilon\) value smaller than \(10^{-4}\), the optimal point would be defined by the new constraint and one of the constraints \(A_{2\cdot}\) or \(A_{1\cdot}\), which would lead again to a well-behaved (i.e. stable) solutions. In summary, this sort of issue can only arise when either the feasible region is either unbounded or very large. See the Tolerances and User-Scaling section for further guidance on bounding the feasible region.

## Optimizing Over the Circle#

Now we provide our first thought experiment: Consider the problem of optimizing a linear function over the feasible region defined by the constraints

i.e. the feasible region is essentially a unit circle in \(\mathbb{R}^2\). Note that for all objective functions, the corresponding optimal point will be defined by two linear constraints that are very close to be parallel. What will happen to the numerical solution to the problem? Can you guess? The situation is depicted in the figure below:

To perform the experiment, we execute the code circleOpt.py, where we randomly select an objective vector, find the optimal solution to the resulting optimization problem, and compute several relevant quantities:

The worst

*distance*between the reported primal solution, and the theoretical solution to the problem of actually optimizing over a perfect circle, over all previous runs.The worst bound violation reported by Gurobi over all previous runs.

The worst constraint violation reported by Gurobi over all previous runs.

The worst dual violation reported by Gurobi over all previous runs.

The number of previous experiments.

Accumulated number of simplex iterations.

The \(\kappa\) (KappaExact attribute) value for the current optimal basis.

Sample output is shown below:

```
Added 2 Vars and 1048576 constraints in 19.19 seconds
Errors: 8.65535e-08 0 2.94137e-07 2.77556e-17 Iter 0 10 Kappa 3150.06
Errors: 4.81978e-07 0 3.22359e-07 2.77556e-17 Iter 1 21 Kappa 3009.12
Errors: 4.81978e-07 0 3.4936e-07 1.11022e-16 Iter 2 33 Kappa 2890.58
Errors: 1.53201e-06 0 9.78818e-07 1.11022e-16 Iter 6 79 Kappa 1727.89
Errors: 1.61065e-06 0 8.26005e-07 1.11022e-16 Iter 46 536 Kappa 1880.73
Errors: 1.61065e-06 0 8.84782e-07 1.11022e-16 Iter 52 602 Kappa 1817.27
Errors: 1.61065e-06 0 9.4557e-07 1.11022e-16 Iter 54 625 Kappa 1757.96
Errors: 1.69167e-06 0 9.78818e-07 1.11022e-16 Iter 64 742 Kappa 1727.89
Errors: 1.69167e-06 0 3.8268e-07 1.66533e-16 Iter 88 1022 Kappa 2761.99
Errors: 1.69167e-06 0 9.04817e-07 1.66533e-16 Iter 92 1067 Kappa 1797.06
Errors: 1.69167e-06 0 2.94137e-07 2.22045e-16 Iter 94 1089 Kappa 3150.06
Errors: 1.69167e-06 0 3.29612e-07 2.22045e-16 Iter 95 1101 Kappa 2975.84
Errors: 1.69167e-06 0 3.4936e-07 2.22045e-16 Iter 98 1137 Kappa 2890.58
Errors: 1.69167e-06 0 9.25086e-07 2.22045e-16 Iter 99 1147 Kappa 1777.3
Errors: 1.69167e-06 0 9.78818e-07 2.22045e-16 Iter 107 1237 Kappa 1727.89
Errors: 1.69167e-06 0 9.99895e-07 2.22045e-16 Iter 112 1293 Kappa 1709.61
Errors: 1.84851e-06 0 9.78818e-07 2.22045e-16 Iter 132 1523 Kappa 1727.89
Errors: 1.96603e-06 0 9.99895e-07 2.22045e-16 Iter 134 1545 Kappa 1709.61
```

Surprisingly the reported errors are rather small. Why is this? There are at least two contributing factors: the model has a bounded feasible region (in this case the range of both variables is \([-1,1]\)). In addition, the distance from one extreme point (a point at the intersection of two neighboring constraints) to its neighbor is also relatively small, so all \(\varepsilon\)-optimal solutions are close to each other.

We encourage you to play with this code, perturb some of the input data, and analyze the results. You will see the discrepancies between the theoretical and the numerical optimal solution will be comparable to the sizes of the perturbations.

## Optimizing Over Thin Regions#

Now we move to our second thought experiment: Consider a feasible region consisting of a triangle in \(\mathbb{R}^2\) with a very wide base and very short height, as depicted here:

Consider the case where the ratio of the base to the height is on the
order of \(10^5\), and that we consider a *nominal* objective
function \(\vec{c}_1\) as in the figure.

In theory, the optimal solution should be the apex of the triangle, but assume that we randomly perturb both the right-hand side and the objective function with terms in the order of \(10^{-6}\). What will happen with the numerical solution?

To perform the experiment, we execute the code thinOpt.py, where we perform a series of re-optimizations with different perturbations as described above. To be more precise, whenever the new computed solution is further from the mathematical solution by more than it has been in previous trials, we print:

The new maximum distance among solutions.

The current iteration.

The \(\kappa\) (KappaExact attribute) value for the current optimal basis.

The bound violation as reported by Gurobi for the current solution.

The constraint violation as reported by Gurobi for the current solution.

The dual violation as reported by Gurobi for the current solution.

Sample output is shown below:

```
New maxdiff 4e+16 Iter 0 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 1 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 2 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 7 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 83 Kappa 3.31072 Violations: 0 0 2.64698e-23
New maxdiff 4e+16 Iter 194 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 1073 Kappa 3.31072 Violations: 0 1.13687e-13 0
New maxdiff 4e+16 Iter 4981 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 19514 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 47117 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 429955 Kappa 3.31072 Violations: 0 0 0
New maxdiff 4e+16 Iter 852480 Kappa 3.31072 Violations: 0 0 0
```

Results look very different from what we saw in our first test. The distance between the solution to the unperturbed model and the solution to the perturbed one is huge, even from the very first iteration. Also, the \(\kappa\) values are relatively small, and the reported primal, dual, and bound violations are almost zero. So, what happened? Note that when we choose \(\vec{c}_1=(0,1)\), we are choosing an optimal point where a small tilting of the objective function may move us to another extreme point very far away, and hence the large norm. This is possible because the region is very large and, in principle, without any bounds, i.e. this is related to the case of \(\varepsilon\)-optimal solutions and very long sides.

Again, we encourage you to play with this example. For example, what would happen if the nominal objective function is \(\vec{c}_2=(1,0)\)?

## Stability and Convergence#

The algorithms used to solve linear programming problems are all forced to make an assumption: that tiny changes to the system (e.g., making a small step in barrier) lead to small changes in the solution. If this is not true (due to ill-conditioning), then the algorithm may jump around in the solution space and have a hard time converging.

Finally, one way to improve the geometry of a problem is by suitably
scaling variables and constraints as explained in the
Tolerances and User-Scaling
section, and working with bounded feasible
sets with *reasonable* ranges for all variables.