Home / multiple coefficient of determination calculator / Expanding the Simple Regression Equation - Amherst

Expanding the Simple Regression Equation - Amherst - multiple coefficient of determination calculator

Multiple Regression

1) Describe different types of multiple regression models and the methods used to analyze and interpret them.
• Non-linear terms
• multiple factor designs

2) Discuss several pitfalls when using regression models.

3) Outline several methods for choosing the 'best-fit model'.
Flavors of Multiple Regression Equations

Non-linear terms
• study time and exam grade
• ice cream and exam grade
• caffeine and exam grade

Multiple Factors
• success in graduate school
b. GPA
c. Undergraduate institution
d. Personal Interview
e. Letters of Recommendation
f. Personal Statement

Non-linear relationships

Expanding the Simple Regression Equation

From this…

y = (0 + (1x (+ ()

To this…

y = (0 + (1x1 + (2x2 + (3x3 + … (+ ()


y: dependent variable

x1, x2, x3: independent variables

(0: y-intercept

(1, (2, (3: parameters relating x’s to y
Least Squares Method

Similar to single regression in that:
• Minimize the squared distance between individual points and the regression line

But different in that:
• The regression line is no longer required to be a straight line.
• Multiple lines to the data
• Estimate many more parameters
o All relevant factors plus (0

Bad News:
Computationally demanding

Good News:
SPSS is good at computationally demanding
Review of Assumptions/Information regarding (

1) Normally distributed with a mean of 0 and a variance equal to (2.

2) Random errors are independent.

As with simple regression, we want (2 to be as small as possible. The smaller it is, the better our prediction will be; i.e., the tighter our data will cluster around our regression line.
• Helps us evaluate utility of model.
• Provides a measure of reliability for our predictions.

(2 = s2
= SSE / (N - # of parameters in model)

s = (s2 = (MSE = root MSE

Analyzing a Multiple Regression Model

(should look familiar to you)

|Step 1 |Hypothesize the deterministic part of the model. |
|Step 2 |Use sample data to estimate unknown parameters ((0, (1, (2). |
|Step 3 |Specify the probability distribution for ( and estimate the standard |
| |deviation. |
|Step 4 |Evaluate the model statistically. |
|Step 5 |Find the best-fit model. |
|Step 6 |Use the model for estimation, prediction, etc. |

Testing Global Usefulness of our Model
Omnibus Test of our Model

Ho: (1 = (2 = (3 = (k = 0

Ha: At least one ( ( 0.

Test Statistic:

F = (SSy – SSE) / k
SSE / [N – (k+1)]

= R2 / k
(1-R2) / [N – (k+1)]

= MS model
MS error

N = sample size
k = # of terms in model

Rejection Region:
F obs > F critical
Numerator df = k
Denominator df = n – (k+1)
Interpretation of our Tests

Omnibus Test

If we reject the null, we are 100(1 - ()% sure that our model does a better job of predicting y than chance.

Important point: Useful yes!
The best? We’ll return to this.

If your model does not pass the Omnibus, you are skating on thin ice if you try to interpret results of individual parameters.


Tests of Individual Parameters

If we reject the null, we are pretty sure that the independent variable contributes to the variance in the dependent variable.
• + direct relationship
• – inverse relationship
How good is our model?

R2 multiple coefficient of determination
Same idea as with simple models.
Proportion of variance explained by our model.

R2 = Variability explained by model
Total variability in the sample

= SSy – SSE

= SS model
SS total
A simple 2nd-Order model:
One linear + one quadratic term

Del Monte asks you to determine the relationship between the salt concentration in a can of corn and consumers' preference ratings. Previous research suggests that there is a non-linear relationship between salt concentration and preference, such that above some value, increasing the concentration of salt does not increase subjective ratings.

Interpretation of Parameter Estimates:

(0: only meaningful if sample contains data in the range of x=0

(1: Is there a straight-line linear relationship between x and y?

(2: Is there a higher-order (curvilinear) relationship between x and y?
▪ + concave upward (bowl-shaped)
▪ – concave downward (mound-shaped)

t-test: (2 – 0
Std. Err. (s(2)

df = N – (k+1)
Multiple Regression: Non-linear relationships (corn)




2-Factor model

The owner of a car dealership needs to hire a new salesperson. Ideally, s/he would like to choose an employee whose personality is well-suited to selling cars and will help them sell as many cars as possible. To this end, s/he rates each of her/his salespeople along two dimensions that s/he has isolated as being particularly related to success as a salesperson: friendliness and aggressiveness. In addition to this information, s/he recorded the number of car sales made by each employee during the most recent quarter.

Do these data suggest that there is a significant relationship between the two identified personality traits and car salespersonship?

Interpretation of Parameter
Estimates for a 2-Factor Model

(0: only meaningful if sample contains data in the range of x=0

(1: Is there a straight-line linear relationship between friendliness and number of sales?
▪ + direct relationship
▪ – inverse relationship

(2: Is there a straight-line relationship between aggressiveness and number of sales?
▪ + direct relationship
▪ – inverse relationship

Multiple Regression: Multiple predictors (car sales)




Regression Pitfalls

1) Parameter Estimability
• Need to have j + 1 levels of your predictor variable where j = the order of polynomial in your model

2) Multi-Collinearity
• You have a problem if your predictor variables are highly correlated with one another
Criterion Variable: Height
Predictor Variables: Length of thumb
Length of index finger

3) Limited range
• You will have trouble finding a relationship if your predictor variables have a limited range.

Model Building: What do I include in my model?

Three predictor experiment can have a jillion factors in the model:

At the very least… But also could use…
1) x1 8) x12
2) x2 9) x22
3) x3 10) x32
4) x1x2 11) x13
5) x1x3 12) x23
6) x2x3 13) x33
7) x1x2x3 14) x12x2x34

and that is not exhaustive…

Why not just put everything in?

• Increasing number of factors will necessarily increase fit of model (at least in terms of R2)
• Type I Error rate
• Parsimony
Statistical Methods for Deciding
What Stays and What Goes

1) Start with every factor
Complete Model

2) Decide which terms you think should be dropped
Reduced Model

3) Question: Is the amount of Error variance in the Reduced Model significantly greater than the amount of error variance in the Complete Model?

If the Error variance is significantly greater, we conclude that the Complete Model is better than the Reduced Model
• the removed factors increase predictive power
• the removed factors should stay in the model

If not, we conclude that the Reduced Model is just as effective as the Complete Model.
• the removed factors do not increase power
• the removed factors should be removed.
Model Building Tests

Ho: (s removed from the model all = 0.
Don’t help us predict y.

Ha: At least one of the removed (s ( 0.

Test Statistic:

F = (SSER - SSEC) / (# of ( removed)

Critical Value: df num # of ( removed
df denom df SSEC
Comparing Full / Reduced Models:
Predicting Preference for Corn
Full Model
| |Sum of Squares |

Comparing Full / Reduced Models:
Predicting Sales Success
Full Model

| |Sum of Squares |

Regression Procedures using SPSS

Forward - Starts with a blank slate and adds each factor one at a time. Retains the factor with the largest F. Adds remaining factors, also one at a time. Continue adding factors with the highest significant F until all significant factors are used up.

Backward - Starts with everything in the model, and removes factors with non-significant F’s one-by-one.

Stepwise - Similar to Forward. Main difference is each time a factor is added, SPSS goes back and checks whether other factors should still be retained.

Maxr - Find the model with the maximum R2 value for a given number of factors. Researcher decides which model is best.
Limitations of model-fitting procedures

1) Often do not include higher-order factors (i.e., interaction and squared terms).

2) Performs LARGE numbers of comparisons so Type I Error rate goes up and up and up.

3) Should be used only as a screening procedure.

Answer to Opening Question:
In research, there is no substitute for strong theories. Allows you to winnow down a vast array of potential factors into those that you consider important. What should you include in your model: Only those factors that are needed to test your theory!
Eyeball/R2 Method

1) Put all your variables in.

2) Eliminate 1 or 2 that contribute the least.

3) Re-run model.

4) Repeat steps 2 and 3 until all factors in your model appear to contribute.

5) While completing step 1 – 4, be aware of the effect that removing a given factor has on R2. Your ultimate goal is to choose a model that maximizes R2 using the smallest number of predictors.
An Example of the Eyeball/R2 Method

What factors contribute to success in a college basketball game?

Here are a number of possibilities:
a) Shooting percentage
b) Free-throw percentage
c) # of fans
d) Game Experience
e) Turnovers
f) # of Ks in coaches name
g) # of Zs in coaches name
h) # of hot dogs sold at concession stand

|Model #1 R2 = .5247 |
|Factor |p-value |Decision |
|Shooting |.0023 | |
|Free Throws |.0032 | |
|Fans |.0853 | |
|Experience |.0672 | |
|Turnovers |.0021 | |
|Ks |.0435 | |
|Zs |.0001 | |
|Hot Dogs |.4235 | |

Reducing the model further

|Model #2 R2 = .4973 |
|Factor |p-value |Decsion |
|Shooting |.0137 | |
|Free Throws |.0432 | |
|Turnovers |.0008 | |
|Ks |.0623 | |
|Zs |.0001 | |


|Model #3 R2 = .3968 |
|Factor |p-value |Decsion |
|Shooting |.0137 | |
|Turnovers |.0008 | |
|Zs |.0001 | |


|Model #4 R2 = .4520 |
|Factor |p-value |Decsion |
|Shooting |.0137 | |
|Free Throws |.0432 | |
|Turnovers |.0008 | |
|Zs |.0001 | |

What do you need to report?

Full Model
• Results of the Omnibus
• R2
• Which factors are significant

Reduced Models
• Which factors you decided to toss
• Results of the Omnibus
• R2
• Which factors are significant

Final Model
• Results of the Omnibus
• R2
• Which factors are significant
• Regression Equation

Regression Overview

Two Experiments:

1) Blood Pressure in Males vs. Females

2) Blood Pressure as a function of Exercise

Which one is ANOVA? Which one is Regression?


Main Difference between ANOVA and Regression:
• the nature of your independent variable
o Categorical IV ANOVA
o Continuous IV Regression

Why do we bother with Regression?


1) Reduces error of prediction
• Best prediction w/o regression: Mean

2) Allows us to get a sense of where someone falls on an unknown dimension, based on a known dimension.


1) Better sense of the population

What is the regression line?

What is the best estimate of (?

What is the best estimate for an unknown observation?


Think of regression as one data set split up into a whole bunch of smaller samples. Each sample corresponds to one value of X.

X is continuous, so the number of smaller samples we can create is effectively infinite…

If we find the mean of each of the mini-samples, we will get a bunch of points. That set of points constitutes our best predictor of Y.

More on the Regression line

With simple regression, we are limited to a straight line, so we can’t always predict Y as well as we would like.

In other words, because we are limited to a straight line predictor, the points that fall on the regression line won’t always be the mean of the sample for a given value of X.

Simple regression line represents the mean of the hypothesized distribution of Y at a given value of X.

ANOVA & Regression: A Comparison

|Differences |
| |ANOVA |Regression |
|Experiment |Controlled |Observational |
|Types of Independent Variables |Categorical |Continuous is better |
| |(only) | |
|Best Use |Discrete # of factors/levels |Infinite range of factors/levels |
|Independent Variable |Controlled |Predictor (x) |
|Dependent Variable |Measured |Criterion (y) |
|Form of Question: |Do various levels of a factor result in |Can we predict performance on a criterion |
| |different levels of performance? |variable from value of predictor? |
|Experimental Method |Put all males in one box, all females in |Put all folks with same value of x in the |
| |another box |same box |

|Similarities |
| |ANOVA |Regression |
|Important Relationship |Independent and Dependent Variables |
|Create a Model |hypothetical relationship between dependent (criterion) and independent (predictor) |
| |variables |
|Theory that guides Analysis |Partition variance into two components: |
| |1) Model |
| |2) Everything else (error) |
|SST | |
| | |
|SSM | |
| | |
|SSE | |
| |[pic] |[pic] |
| |[pic] |[pic] |
| |[pic] |[pic] |
|Technique |Utilize F and t-tests: |
| |Numerator: Var. due to model |
| |Denominator: Var. due to error |
|Conclusion |Does model explain significant portion of variance? |

How do you calculate R squared? r-squared is really the correlation coefficient squared. The formula for r-squared is, (1/ (n-1)∑ (x-μ x) (y-μ y )/σ x σ y) 2 So in order to solve for the r-squared value, we need to calculate the mean and standard deviation of the x values and the y values. We're now going to go through all the steps for solving for the r square value.