**Home** / **linear coefficient r calculator** / 10.simple linear regression - University of California, Berkeley

Data Analysis Toolkit #10: Simple linear regression Page 1

Simple linear regression is the most commonly used technique for determining how one variable of interest (the

response variable) is affected by changes in another variable (the explanatory variable). The terms "response" and

"explanatory" mean the same thing as "dependent" and "independent", but the former terminology is preferred because

the "independent" variable may actually be interdependent with many other variables as well.

Simple linear regression is used for three main purposes:

1. To describe the linear dependence of one variable on another

2. To predict values of one variable from values of another, for which more data are available

3. To correct for the linear dependence of one variable on another, in order to clarify other features of its variability.

Any line fitted through a cloud of data will deviate from each data point to greater or lesser degree. The vertical

distance between a data point and the fitted line is termed a "residual". This distance is a measure of prediction error, in

the sense that it is the discrepancy between the actual value of the response variable and the value predicted by the line.

Linear regression determines the best-fit line through a scatterplot of data, such that the sum of squared residuals is

minimized; equivalently, it minimizes the error variance. The fit is "best" in precisely that sense: the sum of squared

errors is as small as possible. That is why it is also termed "Ordinary Least Squares" regression.

Derivation of linear regression equations

The mathematical problem is straightforward:

given a set of n points (Xi,Yi) on a scatterplot,

find the best-fit line, Y^i = a + bXi

such that the sum of squared errors in Y, (Yi - Y^i )2

is minimized

The derivation proceeds as follows: for convenience, name the sum of squares "Q",

n (Yi - Y^ )2 n

= (Yi - a - bXi )2 (1)

Q = i=1 i=1

Then, Q will be minimized at the values of a and b for which Q / a = 0 and Q / b = 0 . The first of these conditions

is,

Q =

n n n

- 2(Yi - a - bX i ) = 2 na + b X i - Yi = 0 (2)

a i =1 i=1 i=1

which, if we divide through by 2 and solve for a, becomes simply,

a = Y - bX (3)

which says that the constant a (the y-intercept) is set such that the line must go through the mean of x and y. This

makes sense, because this point is the "center" of the data cloud. The second condition for minimizing Q is,

Q =

nn

2

- 2Xi (Yi - a - bX i ) = - 2(XiYi - aX i - bX i )= 0 (4)

b i =1 i =1

If we substitute the expression for a from (3) into (4), then we get,

2

n (X iYi - X iY + bX i X - bX i )= 0 (5)

i =1

We can separate this into two sums,

n (XiYi - XiY )- b

2

n (Xi - X i X )= 0 (6)

i=1 i =1

which becomes directly,

Copyright ? 1996, 2001 Prof. James Kirchner

Data Analysis Toolkit #10: Simple linear regression Page 2

nn

(X iYi - X iY ) (X iYi ) - nX Y

b = i=1 = i=1 (7)

n (X i

2 - XiX)

n (X i

2 )- nX 2

i=1 i=1

We can translate (7) into a more intuitively obvious form, by noting that

n (X 2 - X i X ) = 0 and

n

(X Y - Yi X ) = 0 (8)

i=1 i=1

so that b can be rewritten as the ratio of Cov(x,y) to Var(x):

nn n

1

(X iYi - X iY )+ (X Y - Yi X ) n (X i - X )(Yi - Y )

b = i=1 i=1 = i=1 = Cov( X ,Y ) (9)

n (X i

2 - Xi X )+

1n

n (X 2 - X i X ) n (X i - X )2 Var( X )

i=1 i=1 i=1

The quantities that result from regression analyses can be written in many different forms that are mathematically

equivalent but superficially distinct. All of the following forms of the regression slope b are mathematically equivalent:

b = Cov( X ,Y ) or xy

n

nn

X i Yi n 1 n

(X iYi ) - i=1 i=1

Var( X ) x2 or i=1 n (X iYi ) - nX Y n (X iYi ) - X Y

n

n 2 or i=1 or i=1

Xi

n (X i

2 )- nX 2 1 n or (XY ) - X Y (10)

(X i )2 - i=1 i=1 n (X i

2 )- X 2 (X i

2 )- X 2

i =1

i=1 n

A common notational shorthand is to write the "sum of squares of X" (that is, the sum of squared deviations of the X's

from their mean), the "sum of squares of Y", and the "sum of XY cross products" as,

n (Xi - X )2 =

2

n (X i )- nX 2 (11)

x2 = SSx = ( n - 1)Var( X ) = i=1 i=1

n (Yi - Y )2 =

2

n (Yi )- nY 2 (12)

y2 = SSy = ( n - 1)Var( Y ) = i=1 i=1

xy = Sxy = (n - 1)Cov( X ,Y ) =

nn

(X i - X )(Yi - Y ) = (X iYi ) - nX Y (13)

i=1 i=1

It is important to recognize that x2, y2, and xy, as used in Zar and in equations (10)-(13), are not summations;

instead, they are symbols for the sums of squares and cross products. Note also that S and SS in (11)-(13) are uppercase

S's rather than standard deviations.

Besides the regression slope b and intercept a, the third parameter of fundamental importance is the correlation

coefficient r or the coefficient of determination r2. r2 is the ratio between the variance in Y that is "explained" by the

regression (or, equivalently, the variance in Y^ ), and the total variance in Y. Like b, r2 can be calculated many different

ways:

Var( Y^ ) = b2Var( X ) = (Cov( x, y ))2 Var( Y ) - Var( Y - Y^ ) =

2

Sxy (14)

r2 = Var( Y ) Var( Y ) Var( X )Var( Y ) = Var( Y ) SSxSSy

Copyright ? 1996, 2001 Prof. James Kirchner

Data Analysis Toolkit #10: Simple linear regression Page 3

Equation (14) implies the following relationship between the correlation coefficient, r, the regression slope, b, and the

standard deviations of X and Y (sX and sY):

r = b SX and b = r SY (15)

SY S X

The residuals ei are the deviations of each response value Yi from its estimate Y^i . These residuals can be summed in

the sum of squared errors (SSE). The mean square error (MSE) is just what the name implies, and can also be

2

considered the "error variance" ( sY ?X ). The root-mean-square-error (RMSE), also termed the "standard error of the

regression" ( sY ?X ) is the standard deviation of the residuals. The mean square error and RMSE are calculated by

dividing by n-2, because linear regression removes two degrees of freedom from the data (by estimating two

parameters, a and b).

n SSE

2

n - 1 1 - r2 (16)

2MSE = sY ? X = = Var( Y )(1 - r 2 ) n - 1 RMSE = sY ? X = SSE = sY

ei

ei = Yi - Y^iSSE = i=1 n - 2 n - 2 n - 2 n - 2

where Var(Y) is the sample, not population, variance of Y, and the factors of n-1/n-2 serve only to correct for changes

in the number of degrees of freedom between the calculation of variance (d.f.=n-1) and sY ?X (d.f.=n-2).

Uncertainty in regression parameters

The standard error of the regression slope b can be expressed many different ways, including:

sb = SSY SS X - b2 sY ? X = 1 sY ? X = sY 1 - r 2 b 1 - r 2 b 1

n-2

=

= = - 1 (17)

SSX n sX sX n - 2 r n - 2 n - 2 r 2

If all of the assumptions underlying linear regression are true (see below), the regression slope b will be approximately

t-distributed. Therefore, confidence intervals for b can be calculated as,

CI = b ? t( 2 ),n-2sb (18)

To determine whether the slope of the regression line is statistically significant, one can straightforwardly calculate t,

the number of standard errors that b differs from a slope of zero:

b = r n - 2 (19)

t = sb 1 - r 2

and then use the t-table to evaluate the for this value of t (and n-2 degrees of freedom). The uncertainty in the

elevation of the regression line at the mean X (that is, the uncertainty in Y^ at the mean X) is simply the standard error

of the regression sY ?X , divided by the square root of n. Thus the standard error in the predicted value of Y^i for some

Xi is the uncertainty in the elevation at the mean X, plus the uncertainty in b times the distance from the mean X to Xi,

added in quadrature:

sY^i

= (sY ?X n )2

1 (X i - X )2 sY ?X 1 + (X i - X )2

=

+ (sb (X i - X ))2 = sY ?X n + SSX n Var( X ) (20)

where Var(X) is the population, (not sample) variance of X (that is, it is calculated with n rather than n-1). Y^i is also t-

distributed, so a confidence interval for Y^i can be estimated by multiplying the standard error of Y^i by t(2),n-2. Note

that this confidence interval grows as Xi moves farther and farther from the mean of X. Extrapolation beyond the range

of the data assumes that the underlying relationship continues to be linear beyond that range. Equation (20) gives the

standard error of the Y^i , that is, the Y-values predicted by the regression line. The uncertainty in a new individual

value of Y (that is, the prediction interval rather than the confidence interval) depends not only on the uncertainty in

where the regression line is, but also the uncertainty in where the individual data point Y lies in relation to the

regression line. This latter uncertainty is simply the standard deviation of the residuals, or sY ?X , which is added (in

quadrature) to the uncertainty in Y^i , as follows:

Copyright ? 1996, 2001 Prof. James Kirchner

How do you calculate R squared? r-squared is really the correlation coefficient squared. The formula for r-squared is, (1/ (n-1)∑ (x-μ x) (y-μ y )/σ x σ y) 2 So in order to solve for the r-squared value, we need to calculate the mean and standard deviation of the x values and the y values. We're now going to go through all the steps for solving for the r square value.

Title: Microsoft Word - 10.simple linear regression.doc

Author: jk

Creator: PScript5.dll Version 5.2

Producer: Acrobat Distiller 5.0 (Windows)

CreationDate: Wed Aug 20 12:42:11 2003

ModDate: Wed Aug 20 12:42:11 2003

Tagged: no

Form: none

Pages: 8

Encrypted: no

Page size: 612 x 792 pts (letter) (rotated 0 degrees)

File size: 787236 bytes

Optimized: yes

PDF version: 1.2