
Optimization
We describe several simple but powerful algorithms that are specically designed
for convex optimization. These algorithms require only the ability to compute
the function value and any subgradient for each relevant function. A key feature
of these algorithms is that they maintain converging upper and lower bounds on
the quantity being computed, and thus can compute the quantity to a guaranteed
We demonstrate each on the two-parameter example of chapter 11 in
ac
cur
acy.
chapter 15 they are applied to more substantial problems.
In this chapter we concentrate on the nite-dimensional case these methods are
extended to the in nite-dimensional case in the next chapter.
14.1
Notation and Problem Definitions
We will consider several speci c forms of optimization problems.
The unconstrained problem is to compute the minimum value of ,
= min (z)
(14.1)
and in addition to compute a minimizer x , which satis es (x ) = . We will use
the notation
x = argmin (z)
to mean that x is some minimizer of .
The constrained optimization problem is to compute the minimum value of ,
subject to the constraints 1(z) 0, ..., m(z) 0, i.e.,
=
min
(z)
1(z) 0 ... m(z) 0
311




312
CHAPTER 14 SPECIAL ALGORITHMS FOR CONVEX OPTIMIZATION
and in addition to compute a minimizer x that satis es (x ) = , 1(x ) 0,
..., m(x ) 0. To simplify notation we de ne the constraint function
(z) = max
1 i m i(z)
so we can express the constraints as (z) 0:
= min (z):
(14.2)
(z) 0
We will say that z is feasible if (z) 0.
The feasibility problem is:
nd x such that (x) 0 or determine that there is no such x:
(14.3)
Algorithms that are designed for one form of the convex optimization problem
can often be modi ed or adapted for others we will see several examples of this.
It is also possible to modify the algorithms we present to directly solve other prob-
lems that we do not consider, e.g., to nd Pareto optimal points or to do goal
programming.
Throughout this chapter, unless otherwise stated, and 1 ... m are convex
functions from n into . The constraint function is then also convex.
R
R
14.2
On Algorithms for Convex Optimization
Many algorithms for convex optimization have been devised, and we could not hope
to survey them here. Instead, we give a more detailed description of two types of
algorithms that are speci cally designed for convex problems: cutting-plane and
ellipsoid algorithms.
Let us brie y mention another large family of algorithms, the descent meth-
ods, contrasting them with the algorithms that we will describe. In these methods,
successive iterations produce points that have decreasing objective values. General-
purpose descent methods have been successfully applied to, and adapted for, non-
di erentiable convex optimization see the Notes and References at the end of this
chapter. The cutting-plane and ellipsoid algorithms are not descent methods ob-
jective values often increase after an iteration.
Possible advantages and disadvantages of some of the descent methods over
cutting-plane and ellipsoid methods are:
The cutting-plane and ellipsoid algorithms require only the evaluation of func-
tion values and any one (of possibly many) subgradients of functions. Most
(but not all) descent methods require the computation of a descent direction
or even steepest descent direction for the function at a point this can be
a di cult task in itself, and is always at least as di cult as computing a
subgradient.




14.3 CUTTING-PLANE ALGORITHMS
313
Many (but not all) descent methods for convex optimization retain the heuris-
tic stopping criteria used when they are applied to general (nonconvex) opti-
mization problems. In contrast we will see that the cutting-plane and ellipsoid
methods have simple stopping criteria that guarantee the optimum has been
found to a known accuracy.
Most descent methods for nondi erentiable optimization are substantially
more complicated than the cutting-plane and ellipsoid algorithms. These
methods often include parameters that need to be adjusted for the partic-
ular problem.
For smooth problems, many of the descent methods o er substantially faster
convergence, e.g., quadratic.
Let us immediately qualify the foregoing remarks. Many descent methods have
worked very well in practice: one famous example is the simplex method for linear
programming. In addition, it is neither possible nor pro table to draw a sharp
line between descent methods and non-descent methods. For example, the ellipsoid
algorithm can be interpreted as a type of variable-metric descent algorithm. We
refer the reader to the references at the end of this chapter.
14.3
Cutting-Plane Algorithms
14.3.1
Computing a Lower Bound on
We consider the unconstrained problem (14.1). Suppose we have computed function
values and at least one subgradient at x1 ... xk:
(x1) ... (xk)
g1 @ (x1) ... gk @ (xk):
(14.4)
2
2
Each of these function values and subgradients yields an a ne lower bound on :
(z)
(xi) + gTi(z xi)
for all z 1 i k
;
and hence
(z)
lbk(z) = max ; (x
x
1 i k
i) + gTi(z
i) :
(14.5)
;
lbk is a piecewise linear convex function that is everywhere less than or equal to .
Moreover, this global lower bound function is tight at the points x1 ... xk, since
(xi) = lbk(xi) for 1 i k. In fact, lbk is the smallest convex function that
has the function values and subgradients given in (14.4). An example is shown in
gure 14.1.
It follows that
Lk = min lb
z k (z):
(14.6)




314
CHAPTER 14 SPECIAL ALGORITHMS FOR CONVEX OPTIMIZATION
slope 1g
;
;
( )
x
@
@
R
;
lb
@
I
;
3
@
slope 2g
slope 3g
;
;
H
Y
H
3
L
1
3
3
x
x
2
x
x
The maximum of the ane support functionals at 1, 2, and
Figure
14.1
x
x
3 gives the global lower bound function lb3, which is tight at the points
x
1, 2, and 3. 3, the minimum of lb3, which occurs at 3, is found by
x
x
x
L
x
solving (14.7). 3 is a lower bound on .
L
The minimization problem on the right-hand side is readily solved via linear pro-
gramming. We can express (14.6) as
Lk =
min
L
L
z
T
(x
)
+
g
(z
;
x
)
L
1
i
k
i
i
i
which has the form of a linear program in the variable w:
Lk = min cTw
(14.7)
Aw b
where
2
gT1 1 3
2
gT1x1 (x1) 3
;
;
w = z
.
.
L
c = 01 A = 6 .. ... 7 b = 6
..
7
: (14.8)
4
5
4
5
gTk 1
gTkxk (xk)
;
;
In fact, this linear program determines not only Lk, but also a minimizer of lbk,
which we denote xk (shown in gure 14.1 as well).
The idea behind (14.6) is elementary, but it is very important. It shows that
by knowing only a nite number of function values and subgradients of a convex
function, we can deduce a lower bound on the minimum of the function. No such




14.3 CUTTING-PLANE ALGORITHMS
315
property holds for general nonconvex functions. This property will be useful in de-
signing stopping criteria for optimization algorithms that can guarantee a maximum
error.The function lbk can be unbounded below, so that Lk = (e.g. when k =
;1
1 and g1 = 0), which is not a useful bound. This can be avoided by explicitly
6
specifying bounds on the variables, so that we consider
=
min
z
(z):
min z zmax
In this case we have the lower bound
Lk = min
lb
zmin z zmax k (z)
which can be computed by adding the bound inequalities zmin z zmax to the
linear program (14.7).
Finally we mention that having computed at the points x1 ... xk we have
the simple upper bound on :
Uk = min
1 i k (xi)
(14.9)
which is the lowest objective value so far encountered. If an objective value and a
subgradient are evaluated at another point xk+1, the new lower and upper bounds
Lk+1 and Uk+1 are improved:
Lk Lk+1
Uk+1 Uk:
14.3.2
Kelley’s Cutting-Plane Algorithm
Kelley's cutting-plane algorithm is a natural extension of the lower bound compu-
tation of the previous section. It is simply:
x1 zmin zmax any initial box that contains minimizers
k 0
repeat fk k+1
compute (xk) and any gk @ (xk)
2
solve (14.7) to nd xk and Lk
compute Uk using (14.9)
xk+1 xk
until ( Uk Lk
)
g
;
An example of the second iteration of the cutting-plane algorithm is shown in
gure 14.2. The idea behind Kelley's algorithm is that at each iteration the lower
bound function lbk is re ned or improved (i.e., made larger, so that lbk+1
lbk),






316
CHAPTER 14 SPECIAL ALGORITHMS FOR CONVEX OPTIMIZATION
slope 1g
;
;
( )
x
@
@
R
@
I
@
lb
slope 2g
2
;
;
2
L
1
2
x
2
x
x
In the second iteration of the cutting-plane algorithm a sub-
Figure
14.2
gradient 2 at 2 is found, giving the global lower bound function lb2. (14.7)
g
x
is solved to give 2 and 3 = 2. (The next iteration of the cutting-plane
L
x
x
algorithm is shown in gure 14.1.)
since one more term is added to the maximum in (14.5). Moreover, since lbk is tight
at x1 ... xk, it is a good approximation to near x1 ... xk. In the next section,
we will use this idea to show that the cutting-plane algorithm always terminates.
The cutting-plane algorithm maintains upper and lower bounds on the quantity
being computed:
Lk
Uk
which moreover converge as the algorithm proceeds:
Uk Lk 0 as k 0:
;
!
!
Thus we compute to a guaranteed accuracy of : on exit, we have a point with
a low function value, and in addition we have a proof that there are no points with
function value more than better than that of our point. Stopping criteria for
general optimization algorithms (e.g., descent methods) are often more heuristic|
they cannot guarantee that on exit, has been computed to a given accuracy.
Provided
= 0, it is also possible to specify a maximum relative error (as
6
opposed to absolute error), with the modi ed stopping criterion
until ( Uk Lk
min Lk Uk )
;
fj
j
j
jg
which guarantees a relative accuracy of at least on exit. These stopping criteria can
o er a great advantage when the accuracy required is relatively low, for example,





14.3 CUTTING-PLANE ALGORITHMS
317
10%. This relative accuracy can be achieved long before the objective values or
iterates xk appear to be converging still, we can con dently halt the algorithm.
A valid criticism of the cutting-plane algorithm is that the number of constraints
in the linear program (14.7) that must be solved at each iteration grows with the to-
tal number of elapsed iterations. In practice, if these linear programs are initialized
at the previous point, they can be solved very rapidly, e.g., in a few simplex itera-
tions. Some cutting-plane algorithms developed since Kelley's drop constraints, so
the size of the linear programs to be solved does not grow as the algorithm proceeds
see the Notes and References at end of this chapter.
While the cutting-plane algorithm makes use of all of the information ( (xi) gi,
i = 1 ... k) that we have obtained about in previous iterations, the ellipsoid
methods that we will describe later in this chapter maintain a data structure of
constant size (at the cost of an approximation) that describes what we have learned
about the function in past iterations.
14.3.3
Proof of Convergence
We noted above that when the cutting-plane algorithm terminates, we know that
the optimum objective
lies between the bounds L and U, which di er by less
than . In this section we show that the cutting-plane algorithm does in fact always
terminate.
Let Binit denote the initial box, and
G = sup
g :
k
k
g
2
@
(z
)
init
z
2
B
(G can be shown to be nite.) Suppose that for k = 1 ... K the algorithm has not
terminated, i.e. Uk Lk > for k = 1 ... K. Since
;
Lk = lbk(xk+1) = max;
j k (xj) + gTj(xk+1 xj)
;
we have
Lk
(xj) + gTj(xk+1 xj)
1 j k K
;
and hence using (xj) Uk and the Cauchy-Schwarz inequality
Lk Uk G xk+1 xj
1 j k K:
;
k
;
k
From this and Uk Lk > for k K we conclude that
;
xi xj >
= j
(14.10)
k
;
k
G
i j K i 6
in other words, the minimum distance between any two of the points x1 ... xk
exceeds =G.




318
CHAPTER 14 SPECIAL ALGORITHMS FOR CONVEX OPTIMIZATION
A volume argument can now be used to show that K cannot be too large.
Around each xi we place a ball Bi of diameter =G. By (14.10), these balls do not
intersect, so their total volume is K times the volume of one ball. These balls are
all contained in a box B which is the original box Binit enlarged in every dimension
by =2G. Hence the total volume of the balls must be less than the volume of B.
We conclude that K is no larger than the volume of B divided by the volume of
one of the balls.
If we take Kmax = (B)= (B1), then within Kmax iterations, the cutting-
v
ol
v
ol
plane algorithm terminates. (We comment that this upper bound is a very poor
bound, vastly greater than the typical number of iterations required.)
14.3.4
Cutting-Plane Algorithm with Constraints
The cutting-plane algorithm of the previous section can be modi ed in many ways
to handle the constrained optimization problem (14.2). We will show one simple
method, which uses the same basic idea of forming a piecewise linear lower bound
approximation of a convex function based on the function values and subgradients
already evaluated.
Suppose we have computed function values and at least one subgradient at
x1 ... xk for both the objective and the constraint function:
(x1) ... (xk) g1 @ (x1) ... gk @ (xk)
2
2
(x1) ... (xk) h1 @ (x1) ... hk @ (xk):
2
2
These points xi need not be feasible.
We form piecewise linear lower bound functions for both the objective and the
constraint: lbk in (14.5) and
lbk(z) = max ; (x
x
1 i k
i) + hTi(z
i)
(14.11)
;
which satis es lbk(z) (z) for all z
n.
2
R
The lower bound function lbk yields a polyhedral outer approximation to the
feasible set:
z (z) 0
z lbk(z) 0 :
(14.12)
f
j
g
Thus we have the following lower bound on :
Lk = min lbk(z) lbk(z) 0 :
(14.13)
As in section 14.3.1, the optimization problem (14.13) is equivalent to a linear
program:
Lk = min cTw
(14.14)
Aw b




14.3 CUTTING-PLANE ALGORITHMS
319
where
2
gT
3
2
3
1
1
gT1x1 (x1)
;
.
;
.
6
.. ... 7
6
..
7
6
7
6
7
6
7
6
7
w = z
6
gT
1 7
6
gT
(xk) 7
L
c = 01 A = k
k xk
;
;
6
7
b = 6
7
:
6
hT
7
6
hT
(x1) 7
6
1
0 7
6
1 x1
7
;
6
..
7
6
..
7
6
.
... 7
6
.
7
4
5
4
5
hTk 0
hTkxk (xk)
;
We note again that the lower bound Lk in (14.13) can be computed no matter how
the points and subgradients were chosen.
The modi ed cutting-plane algorithm is exactly the same as the cutting-plane
algorithm, except that the linear program (14.14) is solved instead of (14.8), and
the stopping criterion must be modi ed for reasons that we now consider.
If the feasible set is empty then eventually the linear program (14.14) will be-
come infeasible. When this occurs, the cutting-plane algorithm terminates with the
conclusion that the feasible set is empty. On the other hand, if the feasible set is not
empty, and the optimum occurs on the boundary of the feasible set, which is often
the case, then the iterates xk are generally infeasible, since they lie in the outer
approximation