12.1 Modeling Conditional Probabilities
So far, we either looked at estimating the conditional expectations of continuous
variables (as in regression), or at estimating distributions. There are many situations
where however we are interested in input-output relationships, as in regression, but
the output variable is discrete rather than continuous. In particular there are many
situations where we have binary outcomes (it snows in Pittsburgh on a given day, or
it doesn't; this squirrel carries plague, or it doesn't; this loan will be paid back, or
it won't; this person will get heart disease in the next five years, or they won't). In
addition to the binary outcome, we have some input variables, which may or may
not be continuous. How could we model and analyze such data?
We could try to come up with a rule which guesses the binary output from the
input variables. This is called classification, and is an important topic in statistics
and machine learning. However, simply guessing "yes" or "no" is pretty crude --
especially if there is no perfect rule. (Why should there be?) Something which takes
noise into account, and doesn't just give a binary answer, will often be useful. In
short, we want probabilities -- which means we need to fit a stochastic model.
What would be nice, in fact, would be to have conditional distribution of the
response Y , given the input variables, Pr (Y |X ). This would tell us about how pre-
cise our predictions are. If our model says that there's a 51% chance of snow and it
doesn't snow, that's better than if it had said there was a 99% chance of snow (though
even a 99% chance is not a sure thing). We have seen how to estimate conditional
probabilities non-parametrically, and could do this using the kernels for discrete vari-
ables from lecture 6. While there are a lot of merits to this approach, it does involve
coming up with a model for the joint distribution of outputs Y and inputs X , which
can be quite time-consuming.
Let's pick one of the classes and call it "1" and the other "0". (It doesn't mat-
ter which is which. Then Y becomes an indicator variable, and you can convince
yourself that Pr (Y = 1) = E [Y ]. Similarly, Pr (Y = 1|X = x) = E [Y |X = x]. (In
a phrase, "conditional probability is the conditional expectation of the indicator".)
224 CHAPTER 12. LOGISTIC REGRESSION
This helps us because by this point we know all about estimating conditional ex-
pectations. The most straightforward thing for us to do at this point would be to
pick out our favorite smoother and estimate the regression function for the indicator
variable; this will be an estimate of the conditional probability function.
There are two reasons not to just plunge ahead with that idea. One is that proba-
bilities must be between 0 and 1, but our smoothers will not necessarily respect that,
even if all the observed yi they get are either 0 or 1. The other is that we might be
better off making more use of the fact that we are trying to estimate probabilities, by
more explicitly modeling the probability.
Assume that Pr (Y = 1|X = x) = p(x; ), for some function p parameterized by
. parameterized function , and further assume that observations are independent
of each other. The the (conditional) likelihood function is
Pr Y = yi |X = xi = p(xi ; )yi (1 - p(xi ; )1-yi ) (12.1)
Recall that in a sequence of Bernoulli trials y1, . . . yn, where there is a constant
probability of success p, the likelihood is
p yi (1 - p)1-yi (12.2)
As you learned in intro. stats, this likelihood is maximized when p = ^p = n-1 n yi .
If each trial had its own success probability pi , this likelihood becomes
pi (1 - pi )1-yi (12.3)
Without some constraints, estimating the "inhomogeneous Bernoulli" model by max-
imum likelihood doesn't work; we'd get ^pi = 1 when yi = 1, ^pi = 0 when yi = 0, and
learn nothing. If on the other hand we assume that the pi aren't just arbitrary num-
bers but are linked together, those constraints give non-trivial parameter estimates,
and let us generalize. In the kind of model we are talking about, the constraint,
pi = p(xi ; ), tells us that pi must be the same whenever xi is the same, and if p is a
continuous function, then similar values of xi must lead to similar values of pi . As-
suming p is known (up to parameters), the likelihood is a function of , and we can
estimate by maximizing the likelihood. This lecture will be about this approach.
12.2 Logistic Regression
To sum up: we have a binary output variable Y , and we want to model the condi-
tional probability Pr (Y = 1|X = x) as a function of x; any unknown parameters in
the function are to be estimated by maximum likelihood. By now, it will not surprise
you to learn that statisticians have approach this problem by asking themselves "how
can we use linear regression to solve this?"
12.2. LOGISTIC REGRESSION 225
1. The most obvious idea is to let p(x) be a linear function of x. Every increment
of a component of x would add or subtract so much to the probability. The
conceptual problem here is that p must be between 0 and 1, and linear func-
tions are unbounded. Moreover, in many situations we empirically see "dimin-
ishing returns" -- changing p by the same amount requires a bigger change in
x when p is already large (or small) than when p is close to 1/2. Linear models
can't do this.
2. The next most obvious idea is to let log p(x) be a linear function of x, so that
changing an input variable multiplies the probability by a fixed amount. The
problem is that logarithms are unbounded in only one direction, and linear
functions are not.
3. Finally, the easiest modification of log p which has an unbounded range is the
p . We can make this a linear func-
logistic (or logit) transformation, log 1-p
tion of x without fear of nonsensical results. (Of course the results could still
happen to be wrong, but they're not guaranteed to be wrong.)
This last alternative is logistic regression.
Formally, the model logistic regression model is that
1 - p(x) = 0 + x ? (12.4)
Solving for p, this gives
e 0+x? 1
p(x; b , w) = 1 + e0+x? = 1 + e-(0+x?) (12.5)
Notice that the over-all specification is a lot easier to grasp in terms of the transformed
probability that in terms of the untransformed probability.1
To minimize the mis-classification rate, we should predict Y = 1 when p 0.5
and Y = 0 when p < 0.5. This means guessing 1 whenever 0 + x ? is non-negative,
and 0 otherwise. So logistic regression gives us a linear classifier. The decision
boundary separating the two predicted classes is the solution of 0 + x ? = 0,
which is a point if x is one dimensional, a line if it is two dimensional, etc. One can
show (exercise!) that the distance from the decision boundary is 0/+ x ?/.
Logistic regression not only says where the boundary between the classes is, but also
says (via Eq. 12.5) that the class probabilities depend on distance from the boundary,
in a particular way, and that they go towards the extremes (0 and 1) more rapidly
when is larger. It's these statements about probabilities which make logistic
regression more than just a classifier. It makes stronger, more detailed predictions,
and can be fit in a different way; but those strong predictions could be wrong.
Using logistic regression to predict class probabilities is a modeling choice, just
like it's a modeling choice to predict quantitative variables with linear regression.
1Unless you've taken statistical mechanics, in which case you recognize that this is the Boltzmann
distribution for a system with two states, which differ in energy by 0 + x ? .
What is the function of logistic regression?Logistic regression is a classification algorithm used to find the probability of event success and event failure. It is used when the dependent variable is binary (0/1, True/False, Yes/No) in nature. It supports categorizing data into discrete classes by studying the relationship from a given set of labelled data.
Author: Cosma Shalizi
Producer: Mac OS X 10.6.8 Quartz PDFContext
CreationDate: Tue Feb 28 01:22:18 2012
ModDate: Tue Feb 28 01:22:18 2012
Page size: 612 x 792 pts (letter) (rotated 0 degrees)
File size: 1783654 bytes
PDF version: 1.3