Skip to content
Sahithyan's S3
1
Sahithyan's S3 — Applied Statistics

Linear Regression

The simplest method in statistics used to model and predict the relationship between continuous variables.

Linear regression involving 2 variables, dependent (response) and independent (explanatory).

  • Scatter Diagram
    Plots observed pairs (xi,yi)(x_i, y_i) to visualize the relationship.
  • Correlation Coefficient
    Measures the strength and direction of the linear relationship.

General population model:

Y=α+βXY = \alpha + \beta X

For an individual observation:

yi=α+βxi+εiy_i = \alpha + \beta x_i + \varepsilon_i

where

  • α\alpha = intercept (value of YY when X=0X = 0)
  • β\beta = slope (rate of change of YY with respect to XX)
  • εi \varepsilon_i = random error, assumed εiN(0,σ2)\varepsilon_i \sim N(0, \sigma^2)

Shows how much of the variance in (YY) is explained by (XX). Denoted by R2R^2.

Denoted by ESS\text{ESS}.

ESS=(yiαβxi)2ESS = \sum (y_i - \alpha - \beta x_i)^2

Suppose the fitted regression line is y^=α^+β^x\hat{y} = \hat{\alpha} + \hat{\beta}x. Finding α^,β^\hat{\alpha}, \hat{\beta} that minimize ESS\text{ESS} is the goal. Least Squares Method is used here.

By setting partial derivatives to zero gives the normal equations:

yi=nα+βxi\sum y_i = n\alpha + \beta \sum x_i xiyi=αxi+βxi2\sum x_i y_i = \alpha \sum x_i + \beta \sum x_i^2

Solving gives:

β^=nxiyi(xi)(yi)nxi2(xi)2\hat{\beta} = \frac{n\sum x_i y_i - (\sum x_i)(\sum y_i)}{n\sum x_i^2 - (\sum x_i)^2} α^=yˉβ^xˉ\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}

Alternate form using deviations:

β^=(xixˉ)(yiyˉ)(xixˉ)2\hat{\beta} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

Under the normal error assumption εiN(0,σ2)\varepsilon_i \sim N(0, \sigma^2):

β^N(β,σ2Sxx)\hat{\beta} \sim N\left(\beta, \frac{\sigma^2}{S_{xx}}\right)

where Sxx=(xixˉ)2=nσS_{xx} = \sum (x_i - \bar{x})^2 = n\sigma .

When σ2\sigma^2 is unknown, estimate it using:

s2=ESSn2s^2 = \frac{ESS}{n - 2}

A confidence interval for the true slope β\beta is:

β^±tα/2,n2×sSxx\hat{\beta} \pm t_{\alpha/2, n-2} \times \frac{s}{\sqrt{S_{xx}}}

Hypothesis Testing on the Regression Coefficient

Section titled “Hypothesis Testing on the Regression Coefficient”

To test whether XX significantly predicts YY:

H0:β=0andH1:β0H_0: \beta = 0 \quad \text{and} \quad H_1: \beta \neq 0

Test statistic:

t=β^0s/Sxxtn2t = \frac{\hat{\beta} - 0}{s / \sqrt{S_{xx}}} \sim t_{n-2}

If t>critical value|t| > \text{critical value}, reject H0H_0, which means the relationship between X and Y is statistically significant.

Analysis of Variance (ANOVA) for Regression

Section titled “Analysis of Variance (ANOVA) for Regression”

Tests whether the regression line fits the data well.

Source of VariationSum of Squares (SS)dfMean Square (MS)
Regression (RSS)i=1n(yi^yˉ)2\sum_{i=1}^n {(\hat{y_i}-\bar{y})^2}11RSS/1\text{RSS} / 1
Error (ESS)i=1n(yiyi^)2\sum_{i=1}^n {(y_i-\hat{y_i})^2}n2n–2ESS/(n2)\text{ESS} / (n–2)
Total (TSS)i=1n(yiyˉ)2\sum_{i=1}^n {(y_i-\bar{y})^2}n1n–1

Here:

  • yiy_i: Actual observed value of the dependent variable for observation ii
  • yi^\hat{y_i}: Predicted value of yiy_i from the regression line
  • yˉ\bar{y}: Mean of all observed yiy_i values (overall average)

If RSS is large relative to ESS, the model fits well. Computed by F-ratio.

Fcalc=RSS/1ESS/(n2)F1,n2F_\text{calc} = \frac{\text{RSS}/1}{\text{ESS}/(n-2)} \sim F_{1, n-2}

H0:regression line does not fit the dataH_0: \text{regression line does not fit the data}

H1:regression line fits the dataH_1: \text{regression line fits the data}

Reject H0H_0 if Fcalc>F1,n2,αF_{\text{calc}} > F_{1, n-2, \alpha}.