**Home** / **logistic regression ppt** / Logistic Regression - CMU Statistics

Chapter 12

Logistic Regression

12.1 Modeling Conditional Probabilities

So far, we either looked at estimating the conditional expectations of continuous

variables (as in regression), or at estimating distributions. There are many situations

where however we are interested in input-output relationships, as in regression, but

the output variable is discrete rather than continuous. In particular there are many

situations where we have binary outcomes (it snows in Pittsburgh on a given day, or

it doesn't; this squirrel carries plague, or it doesn't; this loan will be paid back, or

it won't; this person will get heart disease in the next five years, or they won't). In

addition to the binary outcome, we have some input variables, which may or may

not be continuous. How could we model and analyze such data?

We could try to come up with a rule which guesses the binary output from the

input variables. This is called classification, and is an important topic in statistics

and machine learning. However, simply guessing "yes" or "no" is pretty crude --

especially if there is no perfect rule. (Why should there be?) Something which takes

noise into account, and doesn't just give a binary answer, will often be useful. In

short, we want probabilities -- which means we need to fit a stochastic model.

What would be nice, in fact, would be to have conditional distribution of the

response Y , given the input variables, Pr (Y |X ). This would tell us about how pre-

cise our predictions are. If our model says that there's a 51% chance of snow and it

doesn't snow, that's better than if it had said there was a 99% chance of snow (though

even a 99% chance is not a sure thing). We have seen how to estimate conditional

probabilities non-parametrically, and could do this using the kernels for discrete vari-

ables from lecture 6. While there are a lot of merits to this approach, it does involve

coming up with a model for the joint distribution of outputs Y and inputs X , which

can be quite time-consuming.

Let's pick one of the classes and call it "1" and the other "0". (It doesn't mat-

ter which is which. Then Y becomes an indicator variable, and you can convince

yourself that Pr (Y = 1) = E [Y ]. Similarly, Pr (Y = 1|X = x) = E [Y |X = x]. (In

a phrase, "conditional probability is the conditional expectation of the indicator".)

223

224 CHAPTER 12. LOGISTIC REGRESSION

This helps us because by this point we know all about estimating conditional ex-

pectations. The most straightforward thing for us to do at this point would be to

pick out our favorite smoother and estimate the regression function for the indicator

variable; this will be an estimate of the conditional probability function.

There are two reasons not to just plunge ahead with that idea. One is that proba-

bilities must be between 0 and 1, but our smoothers will not necessarily respect that,

even if all the observed yi they get are either 0 or 1. The other is that we might be

better off making more use of the fact that we are trying to estimate probabilities, by

more explicitly modeling the probability.

Assume that Pr (Y = 1|X = x) = p(x; ), for some function p parameterized by

. parameterized function , and further assume that observations are independent

of each other. The the (conditional) likelihood function is

n n

Pr Y = yi |X = xi = p(xi ; )yi (1 - p(xi ; )1-yi ) (12.1)

i=1 i=1

Recall that in a sequence of Bernoulli trials y1, . . . yn, where there is a constant

probability of success p, the likelihood is

n

p yi (1 - p)1-yi (12.2)

i =1

As you learned in intro. stats, this likelihood is maximized when p = ^p = n-1 n yi .

If each trial had its own success probability pi , this likelihood becomes

i =1

n

yi

pi (1 - pi )1-yi (12.3)

i =1

Without some constraints, estimating the "inhomogeneous Bernoulli" model by max-

imum likelihood doesn't work; we'd get ^pi = 1 when yi = 1, ^pi = 0 when yi = 0, and

learn nothing. If on the other hand we assume that the pi aren't just arbitrary num-

bers but are linked together, those constraints give non-trivial parameter estimates,

and let us generalize. In the kind of model we are talking about, the constraint,

pi = p(xi ; ), tells us that pi must be the same whenever xi is the same, and if p is a

continuous function, then similar values of xi must lead to similar values of pi . As-

suming p is known (up to parameters), the likelihood is a function of , and we can

estimate by maximizing the likelihood. This lecture will be about this approach.

12.2 Logistic Regression

To sum up: we have a binary output variable Y , and we want to model the condi-

tional probability Pr (Y = 1|X = x) as a function of x; any unknown parameters in

the function are to be estimated by maximum likelihood. By now, it will not surprise

you to learn that statisticians have approach this problem by asking themselves "how

can we use linear regression to solve this?"

12.2. LOGISTIC REGRESSION 225

1. The most obvious idea is to let p(x) be a linear function of x. Every increment

of a component of x would add or subtract so much to the probability. The

conceptual problem here is that p must be between 0 and 1, and linear func-

tions are unbounded. Moreover, in many situations we empirically see "dimin-

ishing returns" -- changing p by the same amount requires a bigger change in

x when p is already large (or small) than when p is close to 1/2. Linear models

can't do this.

2. The next most obvious idea is to let log p(x) be a linear function of x, so that

changing an input variable multiplies the probability by a fixed amount. The

problem is that logarithms are unbounded in only one direction, and linear

functions are not.

3. Finally, the easiest modification of log p which has an unbounded range is the

p . We can make this a linear func-

logistic (or logit) transformation, log 1-p

tion of x without fear of nonsensical results. (Of course the results could still

happen to be wrong, but they're not guaranteed to be wrong.)

This last alternative is logistic regression.

Formally, the model logistic regression model is that

log

p(x)

1 - p(x) = 0 + x ? (12.4)

Solving for p, this gives

e 0+x? 1

p(x; b , w) = 1 + e0+x? = 1 + e-(0+x?) (12.5)

Notice that the over-all specification is a lot easier to grasp in terms of the transformed

probability that in terms of the untransformed probability.1

To minimize the mis-classification rate, we should predict Y = 1 when p 0.5

and Y = 0 when p < 0.5. This means guessing 1 whenever 0 + x ? is non-negative,

and 0 otherwise. So logistic regression gives us a linear classifier. The decision

boundary separating the two predicted classes is the solution of 0 + x ? = 0,

which is a point if x is one dimensional, a line if it is two dimensional, etc. One can

show (exercise!) that the distance from the decision boundary is 0/+ x ?/.

Logistic regression not only says where the boundary between the classes is, but also

says (via Eq. 12.5) that the class probabilities depend on distance from the boundary,

in a particular way, and that they go towards the extremes (0 and 1) more rapidly

when is larger. It's these statements about probabilities which make logistic

regression more than just a classifier. It makes stronger, more detailed predictions,

and can be fit in a different way; but those strong predictions could be wrong.

Using logistic regression to predict class probabilities is a modeling choice, just

like it's a modeling choice to predict quantitative variables with linear regression.

1Unless you've taken statistical mechanics, in which case you recognize that this is the Boltzmann

distribution for a system with two states, which differ in energy by 0 + x ? .

What is the function of logistic regression?Logistic regression is a classification algorithm used to find the probability of event success and event failure. It is used when the dependent variable is binary (0/1, True/False, Yes/No) in nature. It supports categorizing data into discrete classes by studying the relationship from a given set of labelled data.

Title: ADAfaEPoV

Subject:

Keywords:

Author: Cosma Shalizi

Creator: Preview

Producer: Mac OS X 10.6.8 Quartz PDFContext

CreationDate: Tue Feb 28 01:22:18 2012

ModDate: Tue Feb 28 01:22:18 2012

Tagged: no

Form: none

Pages: 15

Encrypted: no

Page size: 612 x 792 pts (letter) (rotated 0 degrees)

File size: 1783654 bytes

Optimized: no

PDF version: 1.3