STA258 Lecture 08

Review
- Standard Error
- Confidence Intervals
Today we're going to see large sample Confidence Intervals.
$Z$ Confidence Intervals and $T$ as well.
#tk Large sample Confidence Intervals are on the test.
Large Sample Confidence Intervals
- $Z = \frac{\hat{θ} - θ}{σ_{\hat{θ}}} \sim N (0, 1)$
- We know $σ_{\hat{θ}} = \sqrt{Var (\hat{θ})}$
- If $\hat{θ} = \bar{X}$
- Then $Var (\bar{X}) = \frac{σ^{2}}{n} = \frac{σ}{\sqrt{n}}$
- So we have $\frac{\bar{X} - μ}{\frac{σ}{\sqrt{n}}}$
- Suppose we have $P (a \leq Z \leq b) = 1 - α$
- What should our $a, b$ be?
  - Suppose we have a normal.
  - Our area is $1 - α$
  - So our tails are $\frac{α}{2}$
  - $b = Z_{\frac{α}{2}}$
  - Since we have symmetry
  - $a = Z_{\frac{α}{2}}$
- $P (- Z_{\frac{α}{2}} \leq \frac{\hat{θ} - θ}{σ_{\hat{θ}}} \leq Z_{\frac{α}{2}}) = 1 - α$
- We need to isolate $θ$
- $P (- σ_{\hat{θ}} Z_{\frac{α}{2}} \leq \hat{θ} - θ \leq σ_{\hat{θ}} Z_{\frac{α}{2}})$
- $P (\hat{θ} - Z_{\frac{α}{2}} σ_{\hat{θ}} \leq θ \leq σ_{\hat{θ}} Z_{\frac{α}{2}} + \hat{θ})$
- So our Confidence Intervals
- $[\hat{θ} - Z_{\frac{α}{2}} σ_{\hat{θ}}, \hat{θ} + Z_{\frac{α}{2}} σ_{\hat{θ}}] = \hat{θ} \pm Z_{\frac{α}{2}} σ_{\hat{θ}}$
- So our CI for $θ$ is the point estimator $\pm$ our cutoff times standard error.
Example:
- We have the mean annual household income of a set if $119155$
- Assume this is based on a sample of $80$ households.
- $σ = 30000$
- We have $Y_{1}, \dots, Y_{n} \overset{i i d}{\sim} N (μ, σ^{2})$
- $\hat{θ} = \bar{X}$
- Then $SE [\bar{X}] = \frac{σ}{\sqrt{n}}$
- Then CI for $μ$ :
  - $\bar{X} \pm Z_{\frac{α}{2}} \cdot \frac{σ}{\sqrt{n}}$
- Compute margin of error.
  - $= Z_{\frac{α}{2}} \cdot \frac{σ}{\sqrt{n}}$
  - We want a $90$ CI
  - $1 - α = 0.9$
  - $α = 0.1$
  - $\frac{α}{2} = 0.05$
  - Look at $5 %$ on the table
  - We get $1.64 \cdot \frac{σ}{\sqrt{n}}$
  - $1.64 \cdot \frac{30000}{\sqrt{80}} = 5500.72722464948$
  - CI is
    - $\bar{X} \pm Z_{\frac{α}{2}} \frac{σ}{\sqrt{n}}$
    - $119155 \pm 5500.7272$
    - $119155 + 5500.7272 = 124655.7272$
    - $119155 - 5500.7272 = 113654.2728$
- If we increase the CI. What will happen?
  - Likely we have a larger interval.
  - $1 - α = 0.95$
  - $α = 0.05$
  - $\frac{0.05}{2} = 0.025$
  - Look at point on table $= 1.96$
  - CI
    - $\bar{X} \pm Z_{\frac{α}{2}} \frac{σ}{\sqrt{n}}$
    - $119155 \pm Z_{\frac{α}{2}} \frac{σ}{\sqrt{n}}$
    - $119155 \pm (1.96) \frac{30000}{\sqrt{80}}$
    - $119155 \pm (1.96) \frac{30000}{\sqrt{80}}$
    - $(1.96) \frac{30000}{\sqrt{80}} = 6574.03985384938$
    - See we have a larger interval. Which increases our confidence interval.
    - #tk Can we optimize the tradeoff. Get the most confidence with least interval.
Analyzing pharma
- We're looking for some $n$ to achieve a CI with $0.95$ confidence
- $1 - α = 0.95$
- $α = 0.05$
- $\frac{0.05}{2} = 0.025$
- $m = 0.005$ as our margin of error
- We get $Z_{\frac{α}{2}} = Z_{0.025} = 1.96$
- $σ = 0.0068$
  - How don't we know the mean but we know the $σ$
  - From historical data we can estimate $σ$
- Margin of error
- $Z_{\frac{α}{2}} \frac{σ}{\sqrt{n}}$
- $m = Z_{\frac{α}{2}} \frac{σ}{\sqrt{n}}$
- $n = {\frac{Z_{\frac{α}{2}} σ}{m}}^{2}$
- $n = {(\frac{(1.96) (0.0068)}{(0.005)})}^{2} = 7.10542336$
- So $n = 8$
- Is it realistic to assume that the population follows a normal?
- Usually we need that $n = 30$ for Central Limit Theorem.
- There's a danger that our sampling dist is not normal then CLT doesn't hold.
Example:
- We have $m = 2$
- and $22.5$ as the value for $σ$
- Find the Sample Size Recommended to Estimate Mean.
  - $1 - α = 0.9$
  - $α = 0.1$
  - $\frac{α}{2} = 0.05$
  - $Z_{\frac{α}{2}} = 1.65$
  - $n = {(\frac{Z_{\frac{α}{2}} σ}{m})}^{2}$
  - $n = {(\frac{(1.65) (22.5)}{(2)})}^{2} = 344.56640625$
  - $n = 345$
  - $1 - α = 0.95$
  - $α = 0.05$
  - $\frac{α}{2} = 0.025$
  - $Z_{\frac{α}{2}} = 1.96$
  - $n = {(\frac{(1.96) (22.5)}{(2)})}^{2} = 486.2025$
  - $n = 487$
- #tk that's it for test 1.
Confidence Intervals based on the $t$ dist
- What if $σ$ is not known.
- Estimate it by $S$
- $T = \frac{\bar{X} - μ}{\frac{S}{\sqrt{n}}} \sim t_{(n - 1)}$
- $P (| T | < m) = 1 - α$
- Find two end points of the interval.
- Our area is $1 - α$ our points are $\pm \frac{α}{2}$
- $P (- t_{(\frac{α}{2})} \leq T \leq t_{(\frac{α}{2})}) = 1 - α$
- Always isolate the unknown parameter.
- $P (- t_{(\frac{α}{2})} \leq \frac{\bar{X} - μ}{\frac{S}{\sqrt{n}}} \leq t_{(\frac{α}{2})}) = 1 - α$
- $P (\bar{X} - t_{(\frac{α}{2})} \frac{S}{\sqrt{n}} \leq μ \leq \bar{X} + t_{(\frac{α}{2})} \frac{S}{\sqrt{n}})$
- $C I = \bar{X} \pm t_{(\frac{α}{2})} \frac{S}{\sqrt{n}}$
- The lower and upper limit are Random Variables.
- Prior to observation, the CI is random.
- After an observation, then the CI is deterministic.
- CI is not the most accurate interpretation that our parameter is in there.
- If we have 90 percent confidence, 90 out of 100 trials we will have our Estimator correctly Estimate the parameter.
Normal Population Assumption
- For small samples of $n < 15$ . The data should follow a normal dist. If you see outliers or skewness, be cautious.
- For moderate samples of $15 \leq n \leq 40$ . The data should not show strong skewness or outliers. Make a histogram, boxplot or Q-Q plot to check.
- For large samples of $n > 40$ . The $t$ procedure is fairly robust to non-normality. Unless the data are extremely skewed or contain outliers. Make a histogram, boxplot or Q-Q plot to check.
  - The reason, as $n ↑$ , the $t$ dist approaches the normal dist.
Example:
- Ancient air.
- We can examine gas inside ancient amber.
- Will give us sample of time when amber was formed.
- We have these observations:
  - $n = 9$
  - $\begin{matrix} 63.4 & 65 & 64.4 & 63.3 & 54.8 & 54.5 & 50.8 & 49.2 & 51.0 \end{matrix}$
- Find a $90 %$ CI for the mean nitrogen level.
- Our present atmosphere is $78.% $ nitrogen
- $\bar{X} = 59.58$
- $S^{2} = \frac{\sum (X_{i} - \bar{X})^{2}}{n - 1} = 6.2552$
- $1 - α = 0.9$
- $α = 0.1$
- $\frac{α}{2} = 0.05$
- $t_{(\frac{α}{2})} = t_{0.05; 8} = 1.860$
- $P (\bar{X} - t_{(\frac{α}{2})} \frac{S}{\sqrt{n}} \leq μ \leq \bar{X} + t_{(\frac{α}{2})} \frac{S}{\sqrt{n}}) = 1 - α$
- $P (59.58 - (1.860) \frac{6.2552}{\sqrt{9}} \leq μ \leq 59.58 + (1.860) \frac{6.2552}{\sqrt{9}}) = 0.9$
- $C I = 59.58 \pm (1.860) \frac{6.2552}{\sqrt{9}}$
- $= 59.58 \pm 3.872$
- $= (55.708, 63.452)$
Example:
- A film-processing company want to know how many pictures were stored on computers.
- Random sample of 10 digital camera owners.
- Estimate with $95 %$ confidence the mean number of pictures stored.
- Data:
  - $\begin{matrix} 25 & 6 & 22 & 26 & 31 & 18 & 13 & 20 & 13 & 2 \end{matrix}$
- $n = 10$
- $\bar{X} = 17.7$
- $S = 9.08$
- $1 - α = 0.95$
- $α = 0.05$
- $\frac{α}{2} = 0.025$
- $t_{(\frac{α}{2})} = t_{0.025; 9} = 6.49$
- Margin of error:
- $= t_{(0.025; 9)} \frac{S}{\sqrt{n}}$
- $2.262 (\frac{9.08}{\sqrt{10}}) = 6.49498943710919$
- CI:
  - $17.7 \pm 6.49498943710919$
- Assumption of normality is critical here since $n$ is small.
- #tk learn q-q-plots Normal Q-Q Plot.
Example:
- Suppose we have a random sample:
- $Y_{1}, \dots, Y_{n} \overset{i i d}{\sim} Bernoulli (p)$
- How can we estimate $p$ ?
- Flip a coin $100$ times, how can you estimate $p$ ?
- It's the number of successes over number of trials.
- $S = \sum_{i = 1}^{n} X_{i} \sim Bin (n, p)$
- $S$ is total number of successes.
- $\hat{p} = \frac{S}{n}$
- $\hat{p}$ is our proportion of successes. An estimator for $p$ .
- We need to standardize.
- $E [\hat{p}] = E [\frac{S}{n}] = \frac{1}{n} E [S] = \frac{1}{n} n p = p$
- $Var (\hat{p}) = Var (\frac{S}{n}) = \frac{1}{n^{2}} Var (S) = \frac{1}{n^{2}} n p (1 - p) = \frac{p (1 - p)}{n}$
- $SE [\hat{p}] = \sqrt{Var (\hat{p})} = \sqrt{\frac{p (1 - p)}{n}}$
- Based on Central Limit Theorem we have:
  - $\hat{P} \approx N (p, \frac{p (1 - p)}{n}) ⟹ \frac{\hat{P} - P}{\sqrt{\frac{p (1 - p)}{n}}}$
  - CI for $p$ :
- $\hat{p} \pm Z_{\frac{α}{2}} \sqrt{\frac{p (1 - p)}{n}}$
Example:
- In a poll of $800$ adults
- $45 %$ indicated that movies are getting better.
- $43 %$ indicated that movies are getting worse.
- Find a $98 %$ CI for the proportion of all adults who think movies are getting better.
- $\hat{p} = 0.45$
- $1 - a = 0.98$
- $α = 0.02$
- $\frac{α}{2} = 0.01$
- $σ_{\hat{p}} = \sqrt{\frac{p (1 - p)}{n}}$
- $σ_{\hat{p}} = \sqrt{\frac{0.45 (1 - 0.45)}{800}} = 0.0175890590993379$
- CI for $p$ :
- $0.45 \pm Z_{\frac{α}{2}} \sqrt{\frac{p (1 - p)}{n}} \approx Z_{\frac{α}{2}} \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}$
  - We need to use the plug-in estimate since we don't know $p$ .
  - But we want a result in terms of known quantities.
- $0.45 \pm 2.33 \cdot 0.0175890590993379$
- $0.45 \pm 0.0409757533251168$
- $= (0.409024246674883, 0.490975753325117)$
- Since we have less than $50 %$ for our CI.
- Then even in the best case for our $98 %$ confidence interval, less than half of adults think movies are getting better.
Example:
- Utility of mobile devices raises questions on intrusion of work into personal life.
- $\frac{158}{473}$ of employees took work with them on vacation.
- a:
  - What is the point estimate of the population proportion of all employees who take work with them on vacation?
  - $\hat{p} = \frac{158}{473} = 0.334$
- b:
  - At $0.9$ confidence, what is the margin of error for the population proportion of all employees who take work with them on vacation?
  - $= Z_{\frac{α}{2}} \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}} = 0.035$
  - $1 - α = 0.9$
  - $α = 0.1$
  - $\frac{α}{2} = 0.05$
  - $Z_{\frac{α}{2}} = 1.65$
  - $\sqrt{\frac{(0.334) (1 - 0.334)}{473}} = 0.0216860161877937$
  - $1.65 \cdot 0.0216860161877937 = 0.035$
  - CI:
    - $0.334 \pm 0.035 = (0.299, 0.369)$
Example:
- Aisha Shariff and Yvette Ye are candidates for mayor.
- You are planning a small survey to determine the percent of voters to vote for shariff.
- $p$ is population proportion of voters who will vote for Shariff.
- You want to be $95 %$ confident that your estimate of $p$ is within $0.03$ of the true value.
- How large a sample should you take?
- $m = 0.03$
- $1 - α = 0.95$
- $α = 0.05$
- $\frac{α}{2} = 0.025$
- $m = Z_{\frac{α}{2}} \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}$
- What is $p$ ?
  - When we don't know that value of $\hat{p}$
  - We can use $0.5$
  - If you sketch the plot of $p (1 - p)$
  - At $50 %$ it's maximized.
- ${(\frac{1.96}{0.03})}^{2} (0.5) (1 - 0.5) = 1067.11111111111$
- $n = 1068$
Sample size for an interval estimate of a population proportion:
- $n = {\frac{Z_{*}}{E}}^{2} p^{*} (1 - p^{*})$
- Planning values $p^{*}$ can be chosen by:
  - Sample proportion from a previous sample of the same or similar units
  - Use a planning value of $0.5$ to maximize the required sample size when no reasonable estimate of $p$ is available.
$10 %$ condition
- If less than $10 %$ of the population is sampled, the sample observations can be assumed to be independent.
- We also consider $n \geq 30$ in this fashion, $n \hat{p} \geq 10$ and $n (1 - \hat{p}) \geq 10$ instead of the $30$ rules for CLT.
Confidence Intervals for Variance
- You know $\frac{(n - 1) S^{2}}{σ^{2}} \sim χ_{(n - 1)}^{2}$ as long as we have a normal population from which we draw our sample.
- We want $P (? \leq \frac{(n - 1) S^{2}}{σ^{2}} \leq ?) = 1 - α$
- We need to find two points on the chi-square dist.
  - Top value is:
    - $χ_{(\frac{σ}{2})}^{2}$
  - Bottom value is:
    - $χ_{(1 - \frac{α}{2})}^{2}$
- $P (\frac{1}{χ_{(1 - \frac{α}{2})}^{2}} \geq \frac{σ^{2}}{(n - 1) S^{2}} \geq \frac{1}{χ_{(\frac{α}{2})}^{2}}) = 1 - α$
- $P (\frac{(n - 1) S^{2}}{χ_{(1 - \frac{α}{2})}^{2}} \geq σ^{2} \geq \frac{(n - 1) S^{2}}{χ_{(\frac{α}{2})}^{2}})$
- CI:
  - $[\frac{(n - 1) S^{2}}{χ_{(\frac{α}{2})}^{2}}, \frac{(n - 1) S^{2}}{χ_{(1 - \frac{α}{2})}^{2}}]$
- Estimate $σ^{2}$ with CC $0.9$
- $n = 3$ means $d f = 2$
- $\bar{Y} =$
- $S^{2} = 10.57$
- $[3.53, 205.25]$
  - Crazy wide interval.
  - Maybe because Skewness as well as low samples.