STA258 Lecture 21

Linear Regression Examples:
Example 1:
- $Y_{i} = α + β X_{i}$
- $Y_{i}$ is the response
- $X_{i}$ is the predictor, the independent variable.
- $X_{i}$ is the size of a house in square feet
- $Y_{i}$ is the price of the house in thousands of dollars.
- $\begin{matrix} House & 1 & 2 & 3 & 4 & 5 \\ X_{i} & 1000 & 1200 & 1500 & 1800 & 2000 \\ Y_{i} & 200 & 220 & 260 & 310 & 330 \end{matrix}$
- You can see, that as the size of the house increases, the price also increases.
- This line is for the goal of prediction.
- $\hat{β} = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}$
- You also have a short hand as $\hat{β} = \frac{S_{X Y}}{S_{X X}}$
- $\hat{α} = \bar{Y} - \hat{β} \bar{X}$
- $\bar{X} = \frac{1000 + 1200 + 1500 + 1800 + 2000}{5} = 1500$
- $\bar{Y} = \frac{200 + 220 + 260 + 310 + 330}{5} = 264$
- $\begin{matrix} X_{i} & Y_{i} & X_{i} - \bar{X} & Y_{i} - \bar{Y} & (X_{i} - \bar{X}) (Y_{i} - \bar{Y}) \\ 1000 & 200 & - 500 & - 64 & 32000 \\ 1200 & 220 & - 300 & - 44 & 13200 \\ 1500 & 260 & 0 & - 4 & 0 \\ 1800 & 310 & 300 & 46 & 13800 \\ 2000 & 330 & 500 & 66 & 33000 \end{matrix}$
- $S_{X Y} = 92000$
- $S_{X X} = 680000$
- $\hat{β} = \frac{92000}{680000} = \frac{23}{170} \approx 0.1353$
- $\hat{α} = 264 - (0.1353) (1500) \approx 61.06$
- $\hat{Y} = \hat{α} + \hat{β} X \approx 0.1353 X + 61.06$
- #tk
  - We will be asked to provide an interpretation of the computed slope and intercept
  - Slope:
    - For any additional unit of $X$ , how much does the response value increase or decrease.
    - Here it means each additional square feet of the house is associated with an increase of approximately $135.3$ dollars in the price of the house.
  - Intercept:
    - The expected value of the response when the predictor is zero.
    - Here it means that a house with zero square feet is expected to cost approximately $61060$ dollars, which doesn't make sense in this context, but it's just a mathematical artifact of the linear model.
- Suppose we want a house with $1600$ square feet, we can predict the price using our fitted line:
  - $\hat{Y} = 0.1353 (1600) + 61.06 \approx 277.54 ⟹ 277, 540 dollars$
- It's better to predict by interpolating within the range of our data (between $1000$ and $2000$ square feet) rather than extrapolating outside that range, as extrapolation can lead to unreliable predictions.
- Extrapolation may not work.
  - Neural networks are powerful in extrapolation, but linear regression is not.
- Extrapolation Example:
  - If we try to predict the price of a house with $2500$ square feet using our model:
  - $\hat{Y} = 0.1353 (2500) + 61.06 \approx 394.81 ⟹ 394, 810 dollars$
  - This prediction may not be accurate because it is outside the range of our data, and the relationship between size and price may not be linear for larger houses.
- $ρ$ : correlation coefficient, measures the strength and direction of the linear relationship between $X$ and $Y$ .
- Correlation Coefficient:
  - But for samples, $r = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2}}}$
  - $\sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2} = 12500$
  - $r = \frac{92000}{\sqrt{(680000) (12500)}} = 0.997880105965818 \approx 0.997$
    - Means extremely strong positive linear relationship between the square footage of the house and price.
  - $- 1 \leq r \leq 1$
  - $r = 1$ means a perfect positive linear relationship, $r = - 1$ means a perfect negative linear relationship, and $r = 0$ means no linear relationship.
  - Correlation
- Coefficient of determination:
  - $r^{2}$
    - $0 < r^{2} < 1$
    - Makes it positive.
  - ${0.997}^{2} \approx 0.994$
  - Interpretation: approximately $99.4 %$ of the variability in house prices can be explained by the linear relationship with square footage, while the remaining $0.6 %$ is due to other factors not captured by our model.
    - $0.6 %$ could be:
      - Location of the house
      - Housing market condition
      - Age of the house
    - However, we can then just add these quantitative variables into our model to improve it.
      - We can have $\bar{Y} = α + β_{1} (Size) + β_{2} (Location)$
      - Then the coefficient of determination would increase, as we are explaining more variability in the response variable.
      - Obviously potential that our added doesn't increase it because it doesn't add any data.
        
        Like if the owner has a dog or not, that might not be relevant to the price of the house.
    - Through hypothesis testing, we can figure out which coefficients should be $0$ or non zero, essentially telling us which variables are relevant to the price of the house.
  - Coefficient of determination is also:
    - $1 - \frac{S S E}{S S T} = \frac{\sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2}}{\sum_{i = 1}^{n} (Y_{i} - \hat{Y})^{2}}$
    - $S S E$ : sum of squared errors, $S S T$ : total sum of squares.
Example 2:
- Tool die maker has a small shop
- Wants to understand electricity costs.
- $X_{i}$ : number of tools made in a day
- $Y_{i}$ : daily electricity cost in dollars
- $\begin{matrix} Day & 1 & 2 & 3 & 4 & 5 \\ X_{i} & 7 & 3 & 2 & 5 & 8 \\ Y_{i} & 23.80 & 11.89 & 15.89 & 26.11 & 31.79 \end{matrix}$
- $\bar{X} = \frac{7 + 3 + 2 + 5 + 8}{5} = 5$
- $\bar{Y} = \frac{23.80 + 11.89 + 15.89 + 26.11 + 31.79}{5} = 21.896$
- $\begin{matrix} Day & 1 & 2 & 3 & 4 & 5 \\ X_{i} & 7 & 3 & 2 & 5 & 8 \\ Y_{i} & 23.80 & 11.89 & 15.89 & 26.11 & 31.79 \\ X_{i} - \bar{X} & 2 & - 2 & - 3 & 0 & 3 \\ Y_{i} - \bar{Y} & 1.904 & - 10.006 & - 6.006 & 4.214 & 9.894 \\ (X_{i} - \bar{X}) (Y_{i} - \bar{Y}) & 3.808 & 20.012 & 18.018 & 0 & 29.682 \end{matrix}$
- $S_{X Y} = 3.808 + 20.012 + 18.018 + 0 + 29.682 = 71.52$
- $S_{X X} = 2^{2} + (- 2)^{2} + (- 3)^{2} + 0^{2} + 3^{2} = 26$
- $\hat{β} = \frac{71.52}{26} = 2.75076923076923$
- $α = \bar{Y} - β X$
- $\hat{α} = 21.896 - (2.75076923076923) (5) = 8.14215384615385$
- $\hat{Y} = \hat{α} + \hat{β} X \approx 2.75076923076923 X + 8.14215384615385$
- $\hat{Y} = 2.75 X + 8.14$
- $r = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2}}}$
- $S_{X X} = 26$
- $S_{Y Y} = {1.904}^{2} + (- 10.006)^{2} + (- 6.006)^{2} + {4.214}^{2} + {9.894}^{2} = 255.46632$
- $r = \frac{S_{X Y}}{\sqrt{S_{X X} S_{Y Y}}}$
- $r = \frac{71.52}{\sqrt{26 \cdot 255.46632}} = 0.877554314543112 \approx 0.8776$
- $r^{2} = {0.877554314543112}^{2} = 0.770101574973231 \approx 0.7701$
- So approximately $77.01 %$ of the variability in daily electricity costs can be explained by the linear relationship with the number of tools made, while the remaining $22.99 %$ is due to other factors not captured by our model.
Example 3:
- $X_{i}$ : Year
- $Y_{i}$ : Farmers, population in millions
- $\begin{matrix} Year & Population (millions) & X_{i} - \bar{X} & Y_{i} - \bar{Y} & (X_{i} - \bar{X}) (Y_{i} - \bar{Y}) \\ 1935 & 32.11 & - 22.5 & 13.819 & - 310.9275 \\ 1940 & 30.5 1945 & 24.4 1950 & 23.0 1955 & 19.1 1960 & 15.6 1965 & 12.4 1970 & 9.7 1975 & 8.9 1980 & 7.2 \end{matrix}$
- $\bar{X} = \frac{1935 + 1940 + 1945 + 1950 + 1955 + 1960 + 1965 + 1970 + 1975 + 1980}{10} = \frac{3915}{2} = 1957.5$
- $\bar{Y} = \frac{32.11 + 30.5 + 24.4 + 23.0 + 19.1 + 15.6 + 12.4 + 9.7 + 8.9 + 7.2}{10} = 18.291$
- $S_{X X} = 2062.5$
- $S_{X Y} = 1210.475$
- $S_{Y Y} = 727.125$
- $\hat{β} = \frac{S_{X Y}}{S_{X X}} = \frac{- 1210.475}{2062.5} = 0.58689696969697 \approx - 0.5869$
- $\hat{α} = \bar{Y} - \hat{β} \bar{X} = 18.291 - (- 0.5869) (1957.5) = 1167.14775$
- $\hat{Y} = \hat{α} + \hat{β} X \approx - 0.5869 X + 1167.14775$
- Suppose that $X = 1990 ⟹ \hat{Y} = 1167.142 - 0.5869 (1990) = - 0.783$
- Which doesn't make sense
- Linear models are good for interpolation, but not extrapolation.
- This is time series.
We use simple models, if we use a non-linear model to fit the data, we might get a perfect fit, but it might not be a good model for prediction, as it might be overfitting the data.
Stock market example:
- We need to use homogeneuous data to fit a model.
- If we don't and use like 40 years of data, we might get a fit, but the conditions are different always.
Always remove outliers too.
The squared error is to be minimized, if we have an outlier, it will have a large squared error, which will dominate the sum of squared errors, and will lead to a poor fit for the rest of the data.
If we use $| ϵ |$ instead of $ϵ^{2}$ or such, $| ϵ |$ is more robust.