Home / multiple coefficient of determination calculator / Expanding the Simple Regression Equation - Amherst
Multiple Regression
______________________________________
1) Describe different types of multiple regression models and the methods used to analyze and interpret them.
• Non-linear terms
• multiple factor designs
2) Discuss several pitfalls when using regression models.
3) Outline several methods for choosing the 'best-fit model'.
Flavors of Multiple Regression Equations
_______________________________________
Non-linear terms
• study time and exam grade
• ice cream and exam grade
• caffeine and exam grade
Multiple Factors
• success in graduate school
a. GRE (MCAT; LSAT)
b. GPA
c. Undergraduate institution
d. Personal Interview
e. Letters of Recommendation
f. Personal Statement
Non-linear relationships
________________________________________
[pic]
Expanding the Simple Regression Equation
________________________________________
From this…
y = (0 + (1x (+ ()
To this…
y = (0 + (1x1 + (2x2 + (3x3 + … (+ ()
________________________________________
y: dependent variable
x1, x2, x3: independent variables
(0: y-intercept
(1, (2, (3: parameters relating x’s to y
Least Squares Method
_______________________________________
Similar to single regression in that:
• Minimize the squared distance between individual points and the regression line
But different in that:
• The regression line is no longer required to be a straight line.
• Multiple lines to the data
• Estimate many more parameters
o All relevant factors plus (0
Bad News:
Computationally demanding
Good News:
SPSS is good at computationally demanding
Review of Assumptions/Information regarding (
__________________________________________
1) Normally distributed with a mean of 0 and a variance equal to (2.
2) Random errors are independent.
__________________________________________
As with simple regression, we want (2 to be as small as possible. The smaller it is, the better our prediction will be; i.e., the tighter our data will cluster around our regression line.
• Helps us evaluate utility of model.
• Provides a measure of reliability for our predictions.
___________________________________________
(2 = s2
= MSE
= SSE / (N - # of parameters in model)
s = (s2 = (MSE = root MSE
Analyzing a Multiple Regression Model
(should look familiar to you)
________________________________________
|Step 1 |Hypothesize the deterministic part of the model. |
|Step 2 |Use sample data to estimate unknown parameters ((0, (1, (2). |
|Step 3 |Specify the probability distribution for ( and estimate the standard |
| |deviation. |
|Step 4 |Evaluate the model statistically. |
|Step 5 |Find the best-fit model. |
|Step 6 |Use the model for estimation, prediction, etc. |
Testing Global Usefulness of our Model
Omnibus Test of our Model
________________________________________
Ho: (1 = (2 = (3 = (k = 0
Ha: At least one ( ( 0.
Test Statistic:
F = (SSy – SSE) / k
SSE / [N – (k+1)]
= R2 / k
(1-R2) / [N – (k+1)]
= MS model
MS error
N = sample size
k = # of terms in model
Rejection Region:
F obs > F critical
Numerator df = k
Denominator df = n – (k+1)
Interpretation of our Tests
________________________________________
Omnibus Test
If we reject the null, we are 100(1 - ()% sure that our model does a better job of predicting y than chance.
Important point: Useful yes!
The best? We’ll return to this.
If your model does not pass the Omnibus, you are skating on thin ice if you try to interpret results of individual parameters.
________________________________________
Tests of Individual Parameters
If we reject the null, we are pretty sure that the independent variable contributes to the variance in the dependent variable.
• + direct relationship
• – inverse relationship
How good is our model?
__________________________________________
R2 multiple coefficient of determination
Same idea as with simple models.
Proportion of variance explained by our model.
R2 = Variability explained by model
Total variability in the sample
= SSy – SSE
SSy
= SS model
SS total
A simple 2nd-Order model:
One linear + one quadratic term
________________________________________
Del Monte asks you to determine the relationship between the salt concentration in a can of corn and consumers' preference ratings. Previous research suggests that there is a non-linear relationship between salt concentration and preference, such that above some value, increasing the concentration of salt does not increase subjective ratings.
[pic]
Interpretation of Parameter Estimates:
______________________________________________
(0: only meaningful if sample contains data in the range of x=0
(1: Is there a straight-line linear relationship between x and y?
(2: Is there a higher-order (curvilinear) relationship between x and y?
▪ + concave upward (bowl-shaped)
▪ – concave downward (mound-shaped)
________________________________________
t-test: (2 – 0
Std. Err. (s(2)
df = N – (k+1)
Multiple Regression: Non-linear relationships (corn)
___________________________________________
[pic]
[pic]
[pic]
2-Factor model
__________________________________________
The owner of a car dealership needs to hire a new salesperson. Ideally, s/he would like to choose an employee whose personality is well-suited to selling cars and will help them sell as many cars as possible. To this end, s/he rates each of her/his salespeople along two dimensions that s/he has isolated as being particularly related to success as a salesperson: friendliness and aggressiveness. In addition to this information, s/he recorded the number of car sales made by each employee during the most recent quarter.
Do these data suggest that there is a significant relationship between the two identified personality traits and car salespersonship?
Interpretation of Parameter
Estimates for a 2-Factor Model
__________________________________________
(0: only meaningful if sample contains data in the range of x=0
(1: Is there a straight-line linear relationship between friendliness and number of sales?
▪ + direct relationship
▪ – inverse relationship
(2: Is there a straight-line relationship between aggressiveness and number of sales?
▪ + direct relationship
▪ – inverse relationship
Multiple Regression: Multiple predictors (car sales)
___________________________________________
[pic]
[pic]
[pic]
Regression Pitfalls
________________________________________
1) Parameter Estimability
• Need to have j + 1 levels of your predictor variable where j = the order of polynomial in your model
2) Multi-Collinearity
• You have a problem if your predictor variables are highly correlated with one another
Criterion Variable: Height
Predictor Variables: Length of thumb
Length of index finger
3) Limited range
• You will have trouble finding a relationship if your predictor variables have a limited range.
EX: SAT and GPA
Model Building: What do I include in my model?
_______________________________________
Three predictor experiment can have a jillion factors in the model:
At the very least… But also could use…
1) x1 8) x12
2) x2 9) x22
3) x3 10) x32
4) x1x2 11) x13
5) x1x3 12) x23
6) x2x3 13) x33
7) x1x2x3 14) x12x2x34
and that is not exhaustive…
______________________________________
Why not just put everything in?
• Increasing number of factors will necessarily increase fit of model (at least in terms of R2)
• Type I Error rate
• Parsimony
Statistical Methods for Deciding
What Stays and What Goes
______________________________________
1) Start with every factor
Complete Model
2) Decide which terms you think should be dropped
Reduced Model
3) Question: Is the amount of Error variance in the Reduced Model significantly greater than the amount of error variance in the Complete Model?
Decision:
If the Error variance is significantly greater, we conclude that the Complete Model is better than the Reduced Model
• the removed factors increase predictive power
• the removed factors should stay in the model
If not, we conclude that the Reduced Model is just as effective as the Complete Model.
• the removed factors do not increase power
• the removed factors should be removed.
Model Building Tests
__________________________________________
Ho: (s removed from the model all = 0.
Don’t help us predict y.
Ha: At least one of the removed (s ( 0.
Test Statistic:
F = (SSER - SSEC) / (# of ( removed)
MSEC
Critical Value: df num # of ( removed
df denom df SSEC
Comparing Full / Reduced Models:
Predicting Preference for Corn
______________________________________
Full Model
| |Sum of Squares |
Comparing Full / Reduced Models:
Predicting Sales Success
______________________________________
Full Model
| |Sum of Squares |
Regression Procedures using SPSS
______________________________________
Forward - Starts with a blank slate and adds each factor one at a time. Retains the factor with the largest F. Adds remaining factors, also one at a time. Continue adding factors with the highest significant F until all significant factors are used up.
Backward - Starts with everything in the model, and removes factors with non-significant F’s one-by-one.
Stepwise - Similar to Forward. Main difference is each time a factor is added, SPSS goes back and checks whether other factors should still be retained.
Maxr - Find the model with the maximum R2 value for a given number of factors. Researcher decides which model is best.
Limitations of model-fitting procedures
______________________________________
1) Often do not include higher-order factors (i.e., interaction and squared terms).
2) Performs LARGE numbers of comparisons so Type I Error rate goes up and up and up.
3) Should be used only as a screening procedure.
Answer to Opening Question:
In research, there is no substitute for strong theories. Allows you to winnow down a vast array of potential factors into those that you consider important. What should you include in your model: Only those factors that are needed to test your theory!
Eyeball/R2 Method
______________________________________
1) Put all your variables in.
2) Eliminate 1 or 2 that contribute the least.
3) Re-run model.
4) Repeat steps 2 and 3 until all factors in your model appear to contribute.
5) While completing step 1 – 4, be aware of the effect that removing a given factor has on R2. Your ultimate goal is to choose a model that maximizes R2 using the smallest number of predictors.
An Example of the Eyeball/R2 Method
__________________________________________
What factors contribute to success in a college basketball game?
Here are a number of possibilities:
a) Shooting percentage
b) Free-throw percentage
c) # of fans
d) Game Experience
e) Turnovers
f) # of Ks in coaches name
g) # of Zs in coaches name
h) # of hot dogs sold at concession stand
|Model #1 R2 = .5247 |
|Factor |p-value |Decision |
|Shooting |.0023 | |
|Free Throws |.0032 | |
|Fans |.0853 | |
|Experience |.0672 | |
|Turnovers |.0021 | |
|Ks |.0435 | |
|Zs |.0001 | |
|Hot Dogs |.4235 | |
Reducing the model further
______________________________________
|Model #2 R2 = .4973 |
|Factor |p-value |Decsion |
|Shooting |.0137 | |
|Free Throws |.0432 | |
|Turnovers |.0008 | |
|Ks |.0623 | |
|Zs |.0001 | |
__________________________________________
|Model #3 R2 = .3968 |
|Factor |p-value |Decsion |
|Shooting |.0137 | |
|Turnovers |.0008 | |
|Zs |.0001 | |
__________________________________________
|Model #4 R2 = .4520 |
|Factor |p-value |Decsion |
|Shooting |.0137 | |
|Free Throws |.0432 | |
|Turnovers |.0008 | |
|Zs |.0001 | |
What do you need to report?
__________________________________________
Full Model
• Results of the Omnibus
• R2
• Which factors are significant
Reduced Models
• Which factors you decided to toss
• Results of the Omnibus
• R2
• Which factors are significant
Final Model
• Results of the Omnibus
• R2
• Which factors are significant
• Regression Equation
Regression Overview
______________________________________
Two Experiments:
1) Blood Pressure in Males vs. Females
2) Blood Pressure as a function of Exercise
Which one is ANOVA? Which one is Regression?
______________________________________
Main Difference between ANOVA and Regression:
• the nature of your independent variable
o Categorical IV ANOVA
o Continuous IV Regression
Why do we bother with Regression?
______________________________________
Prediction
1) Reduces error of prediction
• Best prediction w/o regression: Mean
2) Allows us to get a sense of where someone falls on an unknown dimension, based on a known dimension.
Estimation
1) Better sense of the population
What is the regression line?
______________________________________
What is the best estimate of (?
What is the best estimate for an unknown observation?
______________________________________
Think of regression as one data set split up into a whole bunch of smaller samples. Each sample corresponds to one value of X.
X is continuous, so the number of smaller samples we can create is effectively infinite…
If we find the mean of each of the mini-samples, we will get a bunch of points. That set of points constitutes our best predictor of Y.
More on the Regression line
______________________________________
With simple regression, we are limited to a straight line, so we can’t always predict Y as well as we would like.
In other words, because we are limited to a straight line predictor, the points that fall on the regression line won’t always be the mean of the sample for a given value of X.
Simple regression line represents the mean of the hypothesized distribution of Y at a given value of X.
ANOVA & Regression: A Comparison
________________________________________
|Differences |
| |ANOVA |Regression |
|Experiment |Controlled |Observational |
|Types of Independent Variables |Categorical |Continuous is better |
| |(only) | |
|Best Use |Discrete # of factors/levels |Infinite range of factors/levels |
|Independent Variable |Controlled |Predictor (x) |
|Dependent Variable |Measured |Criterion (y) |
|Form of Question: |Do various levels of a factor result in |Can we predict performance on a criterion |
| |different levels of performance? |variable from value of predictor? |
|Experimental Method |Put all males in one box, all females in |Put all folks with same value of x in the |
| |another box |same box |
|Similarities |
| |ANOVA |Regression |
|Important Relationship |Independent and Dependent Variables |
|Create a Model |hypothetical relationship between dependent (criterion) and independent (predictor) |
| |variables |
|Theory that guides Analysis |Partition variance into two components: |
| |1) Model |
| |2) Everything else (error) |
|SST | |
| | |
|SSM | |
| | |
|SSE | |
| |[pic] |[pic] |
| |[pic] |[pic] |
| |[pic] |[pic] |
|Technique |Utilize F and t-tests: |
| |Numerator: Var. due to model |
| |Denominator: Var. due to error |
|Conclusion |Does model explain significant portion of variance? |
How do you calculate R squared? r-squared is really the correlation coefficient squared. The formula for r-squared is, (1/ (n-1)∑ (x-μ x) (y-μ y )/σ x σ y) 2 So in order to solve for the r-squared value, we need to calculate the mean and standard deviation of the x values and the y values. We're now going to go through all the steps for solving for the r square value.