We will be asked to provide an interpretation of the computed slope and intercept
Slope:
For any additional unit of , how much does the response value increase or decrease.
Here it means each additional square feet of the house is associated with an increase of approximately dollars in the price of the house.
Intercept:
The expected value of the response when the predictor is zero.
Here it means that a house with zero square feet is expected to cost approximately dollars, which doesn't make sense in this context, but it's just a mathematical artifact of the linear model.
Suppose we want a house with square feet, we can predict the price using our fitted line:
It's better to predict by interpolating within the range of our data (between and square feet) rather than extrapolating outside that range, as extrapolation can lead to unreliable predictions.
Extrapolation may not work.
Neural networks are powerful in extrapolation, but linear regression is not.
Extrapolation Example:
If we try to predict the price of a house with square feet using our model:
This prediction may not be accurate because it is outside the range of our data, and the relationship between size and price may not be linear for larger houses.
: correlation coefficient, measures the strength and direction of the linear relationship between and .
Correlation Coefficient:
But for samples,
Means extremely strong positive linear relationship between the square footage of the house and price.
means a perfect positive linear relationship, means a perfect negative linear relationship, and means no linear relationship.
Interpretation: approximately of the variability in house prices can be explained by the linear relationship with square footage, while the remaining is due to other factors not captured by our model.
could be:
Location of the house
Housing market condition
Age of the house
However, we can then just add these quantitative variables into our model to improve it.
We can have
Then the coefficient of determination would increase, as we are explaining more variability in the response variable.
Obviously potential that our added doesn't increase it because it doesn't add any data.
Like if the owner has a dog or not, that might not be relevant to the price of the house.
Through hypothesis testing, we can figure out which coefficients should be or non zero, essentially telling us which variables are relevant to the price of the house.
Coefficient of determination is also:
: sum of squared errors, : total sum of squares.
Example 2:
Tool die maker has a small shop
Wants to understand electricity costs.
: number of tools made in a day
: daily electricity cost in dollars
So approximately of the variability in daily electricity costs can be explained by the linear relationship with the number of tools made, while the remaining is due to other factors not captured by our model.
Example 3:
: Year
: Farmers, population in millions
Suppose that
Which doesn't make sense
Linear models are good for interpolation, but not extrapolation.
This is time series.
We use simple models, if we use a non-linear model to fit the data, we might get a perfect fit, but it might not be a good model for prediction, as it might be overfitting the data.
Stock market example:
We need to use homogeneuous data to fit a model.
If we don't and use like 40 years of data, we might get a fit, but the conditions are different always.
Always remove outliers too.
The squared error is to be minimized, if we have an outlier, it will have a large squared error, which will dominate the sum of squared errors, and will lead to a poor fit for the rest of the data.