convert

Lagrange Multipliers without Permanent Scarring

Dan Klein

Introduction

This tutorial assumes that you want to know what Lagrange multipliers are, but are more interested in getting the
intuitions and central ideas. It contains nothing which would qualify as a formal proof, but the key ideas need
to read or reconstruct the relevant formal results are provided. If you don’t understand Lagrange multipliers,
that’s fine. If you don’t understand vector calculus at all, in particular gradients of functions and surface normal
vectors, the majority of the tutorial is likely to be somewhat unpleasant. Understanding about vector spaces,
spanned subspaces, and linear combinations is a bonus (a few sections will be somewhat mysterious if these
concepts are unclear).

Lagrange multipliers are a mathematical tool for constrained optimization of differentiable functions. In the

basic, unconstrained version, we have some (differentiable) function

that we want to

maximize (or minimize). We can do this by first find extreme points of , which are points where the gradient

is zero, or, equivlantly, each of the partial derivatives is zero. If we’re lucky, points like this that we find

will turn out to be (local) maxima, but they can also be minima or saddle points. We can tell the different cases
apart by a variety of means, including checking properties of the second derivatives or simple inspecting the
function values. Hopefully this is all familiar from calculus, though maybe it’s more concretely clear when
dealing with functions of just one variable.

All kinds of practical problems can crop up in unconstrained optimization, which we won’t worry about

here. One is that

and its derivative can be expensive to compute, causing people to worry about how many

evaluations are needed to find a maximum. A second problem is that there can be (infinitely) many local
maxima which are not global maxima, causing people to despair. We’re going to ignore these issues, which are
as big or bigger problems for the constrained case.

In constrained optimization, we have the same function

to maximize as before. However, we also have

some restrictions on which points in

we are interested in. The points which satisfy our constraints are

refered to as the feasible region. A simple constraint on the feasible region is to add boundaries, such as
insisting that each

be positive. Boundaries complicate matters because extreme points on the boundaries

will not, in general, meet the zero-derivative criterion, and so must be searched for in other ways. You probably
had to deal with boundaries in calculus class. Boundaries correspond to inequality constraints, which we will
say relatively little about in this tutorial.

Lagrange multipliers can help deal with both equality constraints and inequality constraints. For the majority

of the tutorial, we will be concerned only with equality constraints, which restrict the feasible region to points
lying on some surface inside

. Each constraint will be given by a function

, and we will only

be interested in points

where

If you want a

"#%$'&(*)

constraint, you can just move the

)

to the left:

"#%$'&,+-).(0/

−2

−1.5

−1

−0.5

0.5

1.5

−2

−1.5

−1

−0.5

0.5

1.5

2.5

Figure 1: A one-dimensional domain... with a constraint. Maximize the value of

132

while satisfying

265

−2

−1

−2

−1

−10

−8

−6

−4

−2

Figure 2: The paraboloid

182

291:

Trial by Example

Let’s do some example maximizations. First, we’ll have an example of not using Lagrange multipliers.

2.1

A No-Brainer

Let’s say you want to know the maximum value of

;

1<2

subject to the constraint

2=5

(see

figure 1). Here we can just substitute our value for

(1) into , and get our maximum value of

1?2@5

. It

isn’t the most challenging example, but we’ll come back to it once the Lagrange multipliers show up. However,
it highlights a basic way that we might go about dealing with constraints: substitution.

2.2

Substitution

Let

1D2

291:

. This is the downward cupping paraboloid shown in figure 5. The unconstrained

maximum is clearly at

, while the unconstrained minimum is not even defined (you can find points

with

as low as you like). Now let’s say we constrain

and

to lie on the unit circle. To do this, we add

the constraint

4IH

2J5

. Then, we maximize (or minimize) by first solving for one of the

variables explicitly:

2L5

(1)

5C2M:

(2)

−2

−1.5

−1

−0.5

0.5

1.5

−2

−1.5

−1

−0.5

0.5

1.5

−2

−1.5

−1

−0.5

0.5

1.5

−2

−1.5

−1

−0.5

0.5

1.5

Figure 3: The paraboloid

1N2

2O1:

along with two different constraints. Left is the unit circle

right is the line

(3)

and substitute into

182

(4)

182

5I2M:

2Q1:

(5)

5C2M:

(6)

Then, we’re back to a one-dimensional unconstrained problem, which has a maximum at

, where

*=R

and

. This shouldn’t be too surprising; we’re stuck on a circle which trades

for

linearly, while

costs twice as much from .

Finding the constrained minimum here is slightly more complex, and highlights one weakness of this ap-

proach; the one-dimensional problem is still actually somewhat constrained in that

must be in

TU2D5

. The

minimum

value occurs at both these boundary points, where

and

2.3

Inflating Balloons

The main problem with substitution is that, despite our stunning success in the last section, it’s usually very
hard to do. Rather than inventing a new problem and discovering this the hard way, let’s stick with the

from

the last section and consider how the Lagrange multiplier method would work. Figure 3(left) shows a contour
plot of . The contours, or level curves, are ellipses, which are wide in the

dimension, and which represent

points which have the same value of

. The dark circle in the middle is the feasible region satisfying the

constraint

. The arrows point in the directions of greatest increase of . Note that the direction of greatest

increase is always perpendicular to the level curves.

Imagine the ellipses as snapshots of an inflating and balloon. As the ballon expands, the value of

along

the ellipse decreases. The size-zero ellipse has the highest value of

. Consider what happens as the ellipse

expands. At first, the values of

are high, but the ellipse does not intersect the feasible circle anywhere. When

the long axis of the ellipse finally touches the circle at

X Y

as in figure 4(left). This is the maximum

constrained value for

– any larger, and no point on the level curve will be in the feasible circle. The key thing

is that, at

, the ellipse is tangent to the circle.

The ellipse then continues to grow,

dropping, intersecting the circle at four points, until the ellipse sur-

rounds the circle and only the short axis endpoints are still touching. This is the minimum (

). Again, the two curves are tangent. Beyond this value, the level curves do not intersect the circle.

The curves being tangent at the minimum and maximum should make intuitive sense. If the two curves were

not tangent, imagine a point (call it

[

) where they touch. Since the curves aren’t tangent, then the curves will

cross, meeting at

[

, as in figreffig:crossing(right). Since the

contour (light curve) is a level curve, the points

to one side of the contour have greater

value, while the points on the other side have lower

value. Since

we may move anywhere along

and still satisfy the constraint, we can nudge

[

along

to either side of the

contour to either increase or decrease . So

[

cannot be an extreme point.

Differentiable curves which touch but do not cross are tangent, but feel free to verify it by checking derivatives!

−1.04

−1.02

−1

−0.98

−0.96

−0.94

−0.92

−0.1

−0.08

−0.06

−0.04

−0.02

0.02

0.04

0.06

0.08

0.1

−1.5

−1

−0.5

0.5

1.5

−1.5

−1

−0.5

0.5

1.5

−0.8

−0.7

−0.6

0.6

0.65

0.7

0.75

0.8

Figure 4: Level curves of the paraboloid, intersecting the constraint circle.

This intuition is very important; the entire enterprise of Lagrange multipliers (which are coming soon, re-

ally!) rests on it. So here’s another, equivalent, way of looking at the tangent requirement, which generalizes
better. Consider again the zooms in figure 4. Now think about the normal vectors of the contour and constraint
curves. The two curves being tangent at a point is equivalent to the normal vectors being parallel at that point.
The contour is a level curve, and so the gradient of ,

, is normal to it. But that means that at an extreme

point

[

, the gradient of

will be perpendicular to

as well. This should also make sense – the gradient is the

direction of steepest ascent. At a solution

[

, we must be on

, and, while it is fine for

to have a non-zero

gradient, the direction of steepest ascent had better be perpendicular to

. Otherwise, we can project

onto

, get a non-zero direction along

, and nudge

[

along that direction, increasing

but staying on

. If the

direction of steepest increase and decrease take you off perpendicularly off of

, then, even if you are not at an

unconstrained maximum of , there is no local move you can make to increase

which does not take you out

of the feasible region

Formally, we can write our claim that the normal vectors are parallel at an extreme point

[

as:

[

(7)

So, our method for finding extreme points

which satisfy the constraints is to look for point where the following

equations hold true:

]

(8)

]

(9)

We can compactly represent both equations at once by writing the Lagrangian:

B_\]

(10)

and asking for points where

`a\]

(11)

The partial derivatives with respect to

recover the parallel-normals equations, while the partial derivative

with respect to

recovers the constraint

. The

is our first Lagrange multiplier.

Let’s re-solve the circle-paraboloid problem from above using this method. It was so easy to solve with

substition that the Lagrange multiplier method isn’t any easier (if fact it’s harder), but at least it illustrates the
method. The Lagrangian is:

`a\]

(12)

182

2Q1

265

(13)

and we want

Ba\]

(14)

(15)

We can sort out after we find them which are minima, maxima, or neither.

which gives the equations

B_\]

2Q1

\,eI7

(16)

B_\]

2Cf

2Q1

(17)

B_\]

2L5

(18)

(19)

From the first two equations, we must have either

2D5

. If

2D5

, then

AgR

, and

. If

, then

eI7

, and

. These are the minimum and maximum, respectively.

Let’s say we instead want the constraint that

and

sum to 1 (

:-2L5

). Then, we have the situation

in figure ??(right). Before we do anything numeric, convince yourself from the picture that the maximum is
going to occur in the (+,+) quadrant, at a point where the line is tangent to a level curve of

. Also convince

yourself that the minimum will not be defined; that

values get arbitrarily low in both directions along the line

away from the maximum. Formally, we have

`a\]

(20)

182

2Q1

\`e

265

(21)

and we want

Ba\]

(22)

(23)

which gives

`a\]

(24)

`a\]

2Cf

(25)

`a\]

265

(26)

(27)

We can see from the first two equations that

, which, with, since they sum to one, means

1Yhi

5jhi

. At those values,

fkhi

and

2Cfkhi

So what do we have so far? Given a function and a constraint, we can write the Lagrangian, differentiate,

and solve for zero. Actually solving that system of equations can be hard, but note that the Lagrangian is a
function of

+1 variables (

plus

) and so we do have the right number of equations to hope for unique,

existing solutions:

from the

partial derivatives, plus one from the

partial derivative.

2.4

More Dimensions

If we want to have mutliple constraints, this method still works perfectly well, though it get harder to draw
the pictures to illustrate it. To generalize, let’s think of the parallel-normal idea in a slightly different way.
In unconstrained optimization (no constraints), we knew we were at a local extreme because the gradient of

was zero – there was no local direction of motion which increased

. Along came the constraint

and

dashed all hopes of the gradient being completely zero at a constrained extreme

[

, because we were confined

. However, we still wanted that there be no direction of increase inside the feasible region. This occured

whenever the gradient at

[

, while probably not zero, had no components which were perpendicular to the

normal of

[

. To recap: in the presence of a constraint,

[

does not have to be zero at a solution

[

, it

just has to be entirely contained in the (one-dimensional) subspace spanned by

[

The last statement generalizes to multiple constraints. With multiple constraints

, we will insist

that a solution

[

satisfy each

[

. We will also want the gradient

[

to be non-zero along the

Figure 5: A spherical level curve of the function

cqp

with two constraint planes,

2D5

and

2D5

directions that

[

is free to vary. However, given the constraints,

[

cannot make any local movement along

vectors which have any component perpendicular to any constraint. Therefore, our condition should again be
that

[

, while not necessarily zero, is entirely contained in the subspace spanned by the

[

normals.

We can express this by the equation

]

(28)

Which asserts that

[

be a linear combination of the normals, with weights

It turns out that tossing all the constraints into a single Lagrangian accomplishes this:

`a\]

(29)

It should be clear that differentiating

B_\

with respect to

and setting equal to zero recovers the

constraint,

, while differentiating with respect to the

recovers the assertion that the gradient of

have no components which aren’t spanned by the constraints normals.

As an example of multiple constraints, consider figure ??. Imagine that

is the distance from the origin.

Thus, the level surfaces of

are concentric spheres with the gradient pointing straight out of the spheres. Let’s

say we want the minimum of

subject to the constraints that

2D5

and

2D5

, shown as planes in the

figure. Again imagine the spheres as expanding from the center, until it makes contact with the planes. The
unconstrained minimum is, of course, at the origin, where

is zero. The sphere grows, and

increases.

When the sphere’s radius reaches one, the sphere touches both planes individually. At the points of contact, the
gradient of

is perpendicular to the touching plane. Those points would be solutions if that plane were the only

constraint. When the sphere reaches a radius of

, it is touching both planes along their line of intersection.

Note that the gradient is not zero at that point, nor is it perpendicular to either surface. However, it is parallel
to an (equal) combination of the two planes’ normal vectors, or, equivalently, it lies inside the plane spanned
by those vectors (the plane

, [not shown due to my lacking matlab skills]).

A good way to think about the effect of adding constraints is as follows. Before there are any constraints,

there are

dimensions for

to vary along when maximizing, and we want to find points where all

dimensions

have zero gradient. Every time we add a constraint, we restrict one dimension, so we have less freedom in
maximizing. However, that constraint also removes a dimension along which the gradient must be zero. So,
in the “nice” case, we should be able to add as many or few constraints (up to

) as we wish, and everything

should work out.

In the “not-nice” cases, all sorts of things can go wrong. Constraints may be unsatisfiable (e.g.

$I(/

and

$I(Mw

, or subtler situations

can prevent the Lagrange multipliers from existing [more].

The Lagrangian

The Lagrangian

B_\Nx

H@y

is a function of

H{z

variables (remember that

S|O

, plus

one for each of the

|g\

). Differentiating gives the corresponding

H7z

equations, each set to zero,

to solve. The

equations from differentiating with respect to each

recovers our gradient conditions. The

equations from differentiating with respect the

recover the constraints

. So the numbers give us some

confidence that we have the right number of equations to hope for point solutions.

It’s helpful to have an idea of what the Lagrangian actually means. There are two intuitions, described below.

3.1

The Lagrangian as an Encoding

First, we can look at the Lagrangian as an encoding of the problem. This view is easy to understand (but doesn’t
really get us anywhere). Whenever the constraints are satisfied, the

are zero, and so at these point, regarless

of the value of the

multipliers,

B_\}!

. This is a good fact to keep in mind.

You could imagine using the Lagrangian to do constrained maximization in the following way. You move

around

looking for a maximum value of . However, you have no control over

, which gets set in the

worst way possible for you. Therefore, when you choose

is chosen to minimize

. Formally, the

problem is to find the

which gives

B_\X

(30)

Now remember that if your x happens to satisfy the constraints,

`a\<m

, regardless of what

is.

However, if

does not satisfy the constraints, some

. But then,

can be fiddled to make

Ba\

as small as desired, and

B_\

will be

. So

will be the maximum value of

subject to the

constraints.

3.2

Reversing the Scope

The problem with the above view of the Lagrangian is that it really doesn’t accomplish anything beyond en-
coding the constraints and handing us back the same problem we started with: find the maximum value of ,
ignoring the values of

which are not in the feasible region. More usefully, we can switch the

and

from the previous section, and the result still holds:

B_\X

(31)

This is part of the full Kuhn-Tucker theorem (cite), which we aren’t going to prove rigorously. However, the

intuition behind why it’s true is important. Before we examine why this reversal should work, let’s see what it
accomplishes if it’s true.

We originally had a constrained optimization problem. We would very much like for it to become an uncon-

strained optimization problem. Once we fix the values of the

multipliers,

Ba\

becomes a function of

alone. We might be able to maximize that function (it’s unconstrained!) relatively easily. If so, we would get a
solution for each

, call it

. But then we can do an unconstrained minimization of

over the space

. We would then have our solution.

It might not be clear why that’s any different that fixing

and finding a minimizing value

for each

It’s different in two ways. First, unlike

would not be continuous. (Remember that it’s negative

infinity almost everywhere and jumps to

for

which satisfy the constraints.) Second, it is often the case

that we can find a closed-form solution to

while we have nothing useful to say about

. This is also

a general instance of switching to a dual problem when a primal problem is unpleasant in some way. [cites]

3.3

Duality

Let’s say we’re convinced that it would be a good thing if

`a\]

B_\X

(32)

−4

−2

−4

−2

−35

−30

−25

−20

−15

−10

−5

Figure 6: The Lagrangian of the paraboloid

182

with the constraint

265

Now, we’ll argue why this is true, examples first, intuition second, formal proof elsewhere. Recall the no-

brainer problem: maximize

182

subject to

265

. Let’s form the Lagrangian.

`a\]

182

265

(33)

This surface is plotted in figure ??. The straight dark line is the value of

. At that value, the

constraint is satisfied and so, as promised,

has no effect on

and

`a\8?

. At each

, except for

where

. The curving dark line is the value of

for all

. The minimum

value along this

line is at

_\n

, where

, which is the maximum (and

only) value of

among the point(s) satisfying the constraint.

3.4

A sort-of proof of sort-of the Kuhn-Tucker thereom

The Lagrangian is hard to plot when

lS!5

. However, let’s consider what happens in the environment of a point

[

which satisfies the constraints, and which is a local maximum among the points satisfying the constraints.

Since each

[

, the derivatives of

B_\

with respect to each

are zero.

[

may not be zero. But

[

has any component which is not in the space spanned by the constraint normals

[

, then we can

nudge

[

in a direction inside the allowed region, increasing

[

. Since

[

is a local minimum inside that region,

that isn’t possible. So

[

is in the space spanned by the constraint normals

[

, and can therefore be

written as a (unique) linear combination of these normals. Let

[

;

[

be that combination.

Then clearly

[

Now consider a vector

near

[

cannot still be zero, because the linear combination

weights

are unique. But

[

a\}

[

is non-zero. Thus, fixing

and allowing

[

vary, there is some direction (either

[

or the reverse direction where we could nudge

[

to increase

Therefore, at

[

is at a local minimum.

Another way to remember this intuitively is that

is probably not zero, and, if we set it to zero (a huge

nudge),

BX ;m

, and so the maximum of

is the unconstrained maximum of

, which can only be

larger than

[

Let’s look another more example. Recall the paraboloid (figure 5) with the constraint that

and

sum to one.

The maximum value occured at

1Yhi

5jhi

, where

fkhi

. The

value was

2Cfhi

. Figure 7 shows

what happens when we nudge

up and down slightly. At

\ng

, the Lagrangian

is just the original surface

. Its maximum value (2) is at the origin (which obviously doesn’t satisfy the constraint). At

2Cfhi

, the

maximum value of the Lagrangian is at

[

1Yhi

5hi

, (which does satisfy the constraints). The gradient of

is not zero, but it is perpendicular to the constraint line, so

[

is a local maximum along that line. Another way

of thinking of this is that the gradient of

(the top arrow field) is balanced at that point by the scaled gradient

of the constraint (the second arrow field down). We can see the effect by adding these two fields, which forms
the gradient of the Lagrangian (third arrow field). This gradient is zero at

[

with the right

. If we nudge

2Cfkhi

, then suddenly the gradient of

is no longer completely cancelled out by

, and so we can

−2

lambda = 0

−2

lambda = −1.6667

−2

lambda = −1.3333

−2

lambda = −1

−2

Figure 7: Lagrangian surfaces for the paraboloid

182

2Q1:

with the constraint

:-2L5

increase the lagrangian by nudging

[

toward the origin. Similarly, if we nudge

down to

2CfhiC2

, then the

gradient of

is over-cancelled and we can increase the Lagrangian by nudging

[

away from the origin.

3.5

What do the multipliers mean?

A useful aspect of the Lagrange multiplier method is that the values of the multipliers at solution points often
has some significance. Mathematically, a multiplier

is the value of the partial derivative of

with respect to

the constraint

. So it is the rate at which we could increase the Lagrangian if we were to raise the target of that

constraint (from zero). But remember that at solution points

[

a\c=

[

. Therefore, the rate of increase

of the Lagrangian with respect to that constraint is also the rate of increase of the maximum constrained value
of

with respect to that constraint.

In economics, when

is a profit function and the

are constraints on resource amounts,

would be the

amount (possibly negative!) by which profit would rise if one were allowed one more unit of resource

. This

rate is called the shadow price of

, which is interpreted as the amount it would be worth to relax that constraint

upwards (by R&D, mining, bribery, or whatever means).

[Physics example?]

A bigger example than you probably wanted

This section contains a big example of using the Lagrange multiplier method in practice, as well as another
case where the multipliers have an interesting interpretation.

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

Figure 8: The simplex

4.1

Maximum Entropy Models

Extensions

5.1

Inequality Constraints

The Lagrange multiplier method also covers the case of inequality constraints. Constraints of this form are
written

. The key observation about inequality constraints work is that, at any given

, a

has

either

, which are qualitatively very different. The two possibilities are shown in figure ??.

then

is said to be active at

, otherwise it is inactive. If

is active at

, then

is a lot like an

equality constraint; it allows

to be maximum if the gradient of ,

, is either zero or pointing towards

negative values of

(which violate the constraint). However, if the gradient is pointing towards positive values

, then there is no reason that we cannot move in that direction. Recall that we used to write

cJ\

(34)

for a (single) equality constraint. The interpretation was that, if

is a solution,

must be entirely in the

direction of the normal to

. For inequality constraints, we write

(35)

but, if x is a maximum, then if

is non-zero, it not only has to be parallel

, but it must actually

point in the opposite sense along that direction (i.e., out of the feasible side and towards the forbidden side). We
can actually enforce this very simply, by restricting the multiplier to be negative (or zero). Positive mutlipliers
mean that the direction of increasing

is in the same direction as increasing

– but points in that situation

certainly aren’t solutions, as we want to increase

and we are allowed to increase

is inactive at

(

), then we want to be even stricter about what values of

are acceptable from

a solution. In fact, in this case,

must be zero at

. (Intuitively, if

is inactive, then nothing should change at

if we drop

). [better explanation]

In summary, for inequality constraints, we add them to the Lagrangian just as if they were equality con-

straints, except that we require that

and that, if

is not zero, then

is. The situation that one or the

other can be non-zero, but not both, is referred to as complementary slackness. This situation can be compactly
written as

. Bundling it all up, complete with multiple constraints, we get the general Lagrangian:

`a\BBc!

(36)

The Kuhn-Tucker theorem (or our intuitive arguments) tell us that if a point

is a maximum of

subject to

the constraints

and

, then:

`a\BB]

(37)