Statistical learning
Statistical Learning
Terminology
- Input: predictors, independent variables, features, variables
- Output: responses, dependent variables
Prediction
- Predict \(\hat Y = \hat f(X)\)
- The accuracy of \(\hat Y\) depends on the reducible error and irreducible error
- Reducible error: The difference between \(\hat f\) and true \(f\)
- Irreducible error: \(f\) also depends on \(\epsilon\), unmeasurable, unknown
Inference
- Which predictors are associated with Y?
- What is the relationship between Y and each predictor?
- Is a linear equation adequate to represent the relationship between Y and predictors?
How to estimate f
- Parametric: Make assumption about functional form, uses training data to fit the model, easier to estimate a set of parameters, \(\hat f\) may be far from \(f\).
- Non-parametric: No assumption about functional form, estimate f that close to data points without being too rough or wiggly, requires a very large number of observations.
Accuracy vs Interpretability (vs Flexibility)
- More flexible = (maybe) more accurate = Less interpretability
Assess model accuracy
- No Free Lunch theorem: no method dominates all others over all possible data sets.
- Measuring the Quality of Fit: MSE for regression, error rate for classification.
- Training vs Testing error: Monotonic decreasing and U-shape, training error is always smaller since we directly optimize.
- Overfitting vs Underfitting: U-shape testing error, both training and testing error are high.
Bias-Variance Trade-off
\[\mathbb E[(y_0 - \hat f(x_0))^2] = \text{Var}[\hat f(x_0)] + \text{Bias}(\hat f(x_0))^2 + \text{Var}[\epsilon]\]- Remark: Expected MSE cannot lie below \(Var[\epsilon]\)
- Remark: We want to achieve simultaneously low bias and low variance
- Variance: How \(\hat f\) changes if we shift one of the data point
- Bias: Error introduced by approximating a real-life problem
- Remark: High flexible = High variance = Less bias
- Remark: The rate of changing bias and variance matters
The Bayes classifier
- Assigns each observation to most likely class given its predictor values
- Bayes error rate: \(1 - \max_j \Pr(Y = j \|X = x_0)\), analogous to irreducible error
- Gold standard to compare other methods
-
K-nearest neighbor: Unknown \(P(Y = j \| X= x_0)\)
\[\max_{j} \Pr(Y = j \| X = x_0 ) = \frac{1}{k}\sum_{i\in N_0} I(y_i = j)\]- Remark: Close to optimal Bayes classifier
- Remark: Selecting K matters (bias-variance trade-off)
Linear Regression
Question
- Is there a relationship between response and predictors?
- Is the relationship linear?
- How strong the relationship is?
- Is there an interaction (synergy) effect among predictors?
- How accurately our prediction is?
Formulation
\[Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \epsilon\]- Assumption:\(\epsilon\):\(\perp\!\!\!\!\perp X\) , \(\epsilon_i \perp\!\!\!\!\perp \epsilon_j\), catch-all we miss (nonlinear relationship, missing predictors, measurement error)
- Residual: \(e_i = y_i - \hat y_i\)
- Objective: \(\min RSS = e_1^2 + ... + e_n^2\)
- Estimating coefficients: Least squares solution
- Interpretation:
- \(\hat\beta_0\): Average value of Y
- \(\hat\beta_j\): Average increase in Y if we increase \(X_j\) by 1
Simple Linear Regression
-
Number of predictor \(p = 1\)
-
Assessing model:
-
Residual standard error: measure lack of fit, an estimate of standard deviation of \(\epsilon\)
\[\hat\sigma = RSE = \sqrt{\frac{RSS}{n-p-1}}\] -
\(R^2\)-statistics: absolute measure of lack of fit, proportion of variability explained by regression
\[0 \le R^2 = 1 - \frac{RSS}{TSS} \le 1\]- RSS: amount of variability left after performing regression
- TSS: total variance in response Y - amount of variability before regressing (only \(\bar y\))
- Interpretational advantage over RSE, but what is good \(R^2\)?
- Remark: \(R^2 = r^2\) (correlation)
-
Others: confidence interval, hypothesis testing, p-value
-
Multiple Linear Regression
- Number of predictor \(p > 1\)
- Is there relationship between response and predictors?: Hypothesis testing (F-test)
- All predictors
- Linearity assumption satisfies: \(E[\frac{RSS}{n-p-1}] = \sigma^2\)
- \(H_0\): \(E[\frac{RSS}{n-p=1}]=\sigma^2\)
- Remark: Test all together, instead of individual
- Which predictors are important?
- Criteria: Mallow’s \(C_p\), AIC, BIC, Adjusted-\(R^2\)
- Procedure: Forward selection, backward selection, or mixed
- How good model fits?: RSE and \(R^2\)-statistics
- Remark: The relative change between p and RSS
- Remark: More variables always increase \(R^2\) (overfitting)
- How accurate the prediction?
- Reducible error: Inaccuracy between \(f(X)\) and \(\hat Y\), model bias
- Confidence interval: interval contains the true value \(f(X)\), uncertainty around average Y over large X
- Irreducible error: Inaccuracy between \(Y\) and \(\hat Y\), random error \(\epsilon\)
- Prediction interval: interval contains the true value of Y, uncertainty around Y over a particular X
- Reducible error: Inaccuracy between \(f(X)\) and \(\hat Y\), model bias
Qualitative response variable
- Dummy variable: incorporating qualitative variable into regression analysis
- Coding scheme 0/1 (One-hot encoding):
- \(\beta_0\): Average Y when X = 0
- \(\beta_1\): The difference in average Y when X = 0 and X = 1
- Coding scheme -1/1:
- \(\beta_0\): Average Y (ignore X)
- \(\beta_1\): Amount each X have Y that above or below average
- Remark: Coding scheme does not affect the fit
- More than 2 level: Always one fewer dummy variable than number of levels
- No dummy variable = baseline
Extension of linear model
- Remove additive assumption:
- Increase in \(X_j\) associated with one-unit increase in \(X_k\)
- Hierarchical principle: If we include the interaction term, we also should include the main effect even if the p-value associated with the coefficients are not significant.
- Nonlinear relationship: Polynomial regression
Potential Problems
- Nonlinear between response and predictors
- Residual plot: Observe any discerning patterns
- Solution: transformation on X
- Correlation among error terms
- Underestimate the true standard error (Accidentally double the observations)
- Very importance to linear regression
- Non-constant variance of error term
- Residual plot: Funnel shape in residual plot
- Solution: transformation on Y
- Outliers: Point where \(y_i\) is unusual given \(x_i\)
- Remark: Outliers can have no effect on least square fit
- Studentized residuals: Detect outliers
- High-leverage points: Point where \(x_i\) is unusual
- Remark: Removing high leverage points are important
- Remark: High-leverage points are outliers, but not vice versa
- Leverage statistics: To detect high leverage points!
- Between 1 and \(\frac{1}{n}\) and the average value ALWAYS equals \(\frac{p+1}{n}\)
- Collinearity: Two or more variables are closely related to each other
- Difficult to separate out the individual effects
- Inaccurate \(\hat\beta_j\) and \(SE[\hat\beta_j]\) increases, power of the hypothesis test reduces
- Correlation matrix: Good for pairs of variables
- Multicollinearity: Variance inflation factor
- Solution: Drop or combine collinear variables together into a single predictor/
Comparsion with KNN
- Parametric vs non-parametric
- Linearity assumption: LR
- Non-linearity in low dimension: LR
- Non-linearity in high dimension: KNN
- Small observations: LR (Curse of dimensionality)
Logistic Regression
Why not Linear Regression
- Coding scheme: implies outcomes ordering.
- Linear Regression: Hard to interpret as probability, and impossible for more than 2 classes
Logistic Model
- Logistic function: \(p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\)
- Odds: \(\frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1 X}\)
- Logit odds: \(\log \frac{p(X)}{1-p(X)} = \beta_0 + \beta_1 X\)
- \(\beta_1\): Increases X by 1 changes the logit odds by \(\beta_1\).
Estimating coefficients
- Maximum likelihood function:
- Predictions: Possible to have qualitative predictors (using dummy variables)
Multiple Logistic Regression
\[p(Y=1\|X) = p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}\]Multinomial Logistic Regression
- Baseline encoding:
-
Remark: The choice of baseline will change the coefficients, but the fitted value will be the same.
-
Softmax encoding: No baseline
Generative Models for classification
Motivation
- Condition: Substantial separation between 2 classes, small sample size, predictors are approximately normal.
- Challenge: estimating \(f_k(x) \rightarrow\) simplifying assumption
Linear Discriminant Analysis p = 1
- Assumption: \(f_k(x) \sim N(\mu_k, \sigma_k^2)\) and \(\sigma_1^2 = ... = \sigma_K^2 = \sigma^2\)
- Remark: Linear in terms of x
- Challenge: \(\mu_k\) and \(\sigma^2\) are unknown. Then, estimate
Linear Discriminant Analysis p > 1
- Assumption: \(f_k(x)\sim N(\mu_k, \Sigma)\)
- Decision boundary: \(\delta_k(x) = \delta_l(x)\)
- Remark: Prefer when more than 2 classes (view data in low dimension)
Metric
- Confusion matrix
- Sensitivity: Percent of TP
- Specificity: Percent of TN
- Bayes classifier does poorly in Specificity since minimizing total error rate
- Threshold (in K = 2) affects specificity and sensitivity \(\rightarrow\) domain knowledge
- ROC curve: Overall performance of the classifier, all possible threshold, given by area under curve (AUC)
- False positive rate = 1 - Specificity = Type I error
- True positive rate = Sensitivity = Recall = 1 - Type II error = Power
- Precision = Among positive prediction, how many actually TP
Quadratic Discriminant Analysis
- Assumption: \(f_k(x)\sim N(\mu_k, \Sigma_k)\)
- Challenge: High variance \(\rightarrow\) work better with more observations (bias-variance trade-off)
Naive Bayes
- Assumption: Among kth class, p predictors are independent, \(f_k(x) = f_{k1}(x_1) \times... \times f_{kp}(x_p)\)
- Remark: Works well when n < p (reduces variance)
- If \(X_j\) is quantitative, \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) (Similar to QDA with diagonal covariance matrix) or nonparametric (kernel density estimator)
- If \(X_j\) is qualitative, count the proportion of training obs for jth predictor corresponding to each class
Comparison of all methods
- NB takes a form generalized additive model
- LDA is a special case of QDA
- Any linear boundary classifiers are special cases of NB with \(g_{kj} = b_{kj}x_j\)
- NB when \(X_j\) is quantitative and \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) are special case of QDA (\(\Sigma\) is a diagonal matrix)
- LDA will be better than Logistic Regression when Normal assumption does not hold.
- KNN is better when decision boundary is highly non-linear and n » p. If the relationship is non-linear and n \(\approx\) p, QDA is prefer.
Generalized Linear Model
- Y belongs to Exponential Family Distribution
- Poisson Regression: \(\lambda(X_1,...,X_p) = e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}\) and use MLE to find \(\hat\beta\)
- Interpretation: Increases \(X_j\) by one unit with a change in \(\mathbb E[Y]\) by \(\exp(\beta_j)\)
- Mean variance relationship: \(Var[Y] = \mathbb E[Y] = \lambda\)
- Nonnegative fitted values
- Link function: Transform mean of the response so that the transformed mean is a linear function of predictors
Sampling Method
- Model selection: Select proper level of flexibility
- Model assessment: Estimate the test error
Cross-validation
Validation Set approach
- Randomly divided train and valdiation set
- Challenge: Overestimate and highly variable test error rate
- High bias (overestimation), High variance
LOOCV
-
Leave one as validation and the rest as training set. Repeat n times
\[CV_{(n)} = \frac{1}{n}\sum_{i=1}^n MSE_{i}\] - Benefits: Unbiased estimate of test error rate, less variance and bias in train/valid split.
- Challenge: expensive
- Remark: Less expensive for LSS or polynomial regression (exact solution)
k-fold CV
-
Randomly divide the dataset into k groups
\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\] - LOOCV is a special case
- Benefits: Less expensive, more accurate estimation of test error rate
- Challenge: More variance (more random in split), more bias (less in training)
- Classification: Similar idea, but can sometimes underestimate the test error rate!
Bootstrap
- Apply when difficult to obtain a measure of variability.
- Problem: Data cannot be generated from original population.
- Solution: Repeated sampling (with replacement) observations from the original dataset \(\rightarrow\) obtain the estimation and use the formula to obtain the standard error of the estimation.
Regularization
Motivation
- Instead of using LS model
- Model interpretability: Irrelevant variables leads to unnecessary complexity, which makes harder to interpret.
Subset selection
- Best Subset selection: \(2^p - 1\) models
- For each k (0,…, p-1) predictors \(\rightarrow\) Fit \(p - k\) models with one more predictors \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
- Challenge: expensive
Stepwise selection
- Motivation: Large search space \(\rightarrow\) overfitting and high variance in the estimate
- Forward:
- For each k (1,…, p) predictors \(\rightarrow\) Fit \(p - k\) models with one additional models \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
- Do well in practice
- Not guarantee to find best model
- Possible to apply when n < p
- Backward:
- For each k (p, p-1, …, 1) predictors \(\rightarrow\) Fit \(k\) models with one less predictor than previously \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
- Requires n > p
- Hybrid: Combination of two previous ones
Choosing the optimal model
- Indirectly estimate testing error: adjust training error
- Assumption about true underlying model
- \(C_p\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
- \(\hat\sigma^2\): estimate of \(\epsilon\) (full-model)
- More features (d) will increase \(C_p\)
- \(E[C_p]\) = Test MSE (unbiased)
- \(AIC\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
- Class of models fit by MLE
- Proportional to \(C_p\)
- \(BIC\): \(\frac{1}{n}(RSS + \log(n)d\hat\sigma^2)\)
- Bayesian POV
- Heavier penalty for large model
- Adjusted-\(R^2\): \(1\frac{RSS/(n-d-1)}{TSS/(n-1)}\)
- Adding more features reduces RSS, and reduces (n-d-1). Therefore, the relative change matters.
- Not well motivated in statistical theory
- Large value is better (Opposite with above three criterion)
- Directly estimate testing error: validation set approach or cross-validation
- Fewer assumption about true underlying model
- Prefer when it is hard to pinpoint df and \(\sigma^2\)
Shrinkage
- Ridge regression:
- \(\lambda\): shrinkage penalty, shrink \(\beta_j\) toward zero
- Remark: shrinkage penalty is not applied to \(\beta_0\) (measure of the mean value of the response)
- Remark: \(\lambda\) increases may increase estimated coefficients
- Remark: LS solution is scale equivariant, while RR coeff is not.
- Remark: Standardizing features before applying ridge regression
- Bias-variance trade-off: Increase bias, reduce variance (flexibility)
- Remark: Shrink coefficients towards, but not exactly, zero
- \(\min_\beta \|\|Y - X\beta\|\|^2\) subject to \(\sum_{j=1}^p \beta_j^2 \le s\)
-
Remark: Normal distribution prior and follows by posterior mode/mean for \(\beta\)
- LASSO:
- Model interpretation/Variable selection: Forcing some coefficients to be exactly zero
- Sparse model: \(\lambda\) is sufficiently large
- \(\min_\beta \|Y - X\beta\|^2\) subject to \(\sum_{j=1}^p \|\beta_j\| \le s\)
- Closely related to Best Subset selection: \(\min_\beta \|Y - X\beta\|^2\) subject to \(\sum_{j=1}^p I(\beta_j\not = 0) \le s\)
- Remark: Corner solution
- Remark: Neither LASSO nor Ridge regression universally dominates the other
- Remark: Since the derivative of absolute function does not exist at 0, we need soft thresholding (explicitly set coefficients to be 0)
- Remark: Laplace distribution prior and follows posterior mode (not mean)
Dimension reduction: To be written in the near future
Consideration in high dimension: To be written in the near future
Beyond Linearity
Polynomial regression
\[y_i = \beta_0 + \beta_1 x_i + ... +\beta_d x_i^d +\epsilon\]- Extremely non-linear curve
- Remark: d < 5, otherwise overly flexible and strange shapes
- Variance of the fit: point-wise square-root of the variance estimate for each coefficients
- Applicable for linear and logistic regression
Step functions:
\[y_i = \beta_0 + \beta_1 C_1(x_i) + ... + \beta_K C_K(x_i) + \epsilon\]- Global structure on the non-linear function
- Continuous variable \(\rightarrow\) ordered categorical variable
- \(C_K(x) = I(c_K \le x \le c_{K+1})\): indicator function, dummy variable
- \(\beta_0\): Mean value of Y for \(X < c_1\)
- \(\beta_j\): Average increase in Y for X in \(c_j \le X < c_{j=1}\) relative to \(X < c_1\)
- Remark: Unless natural breakpoints, miss the action
Basis functions:
\[y_i = \beta_0 + \beta_1 b_1(x_i) + ... + \beta_K b_K(x_i) + \epsilon\]- \(b_k\): fixed and known function (Step and polynomial regression are special case of this)
- Applicable for OLS
Regression splines:
- Piecewise Polynomials: \(y_i = \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \epsilon\) if \(x_i < c\) otherwise \(\beta_{02} + \beta_{12} x_i + \beta_{22}x_i + \epsilon\)
- Knots: c, more knots = more flexible
- Remark: Discontinuous (too flexible)
- Remark: 1 knot and 4 parameters \(\rightarrow\) 8 parameters in total
- Constraints: first up to K-1 order derivative must be continuous
- Remark: Every constraint frees one degree of freedom
- Remark: Cubic splines has K (knots) + 4 (\(\beta_0, \beta_1, \beta_2, \beta_3\)) degrees of freedom
- Spline Basis Representation:
- Truncated Power basis (cubic spline): \(h(x, \xi) = (x - \xi)_+^3\) with \(\xi\) is the location of the knot
- Remark: Discontinuity in third derivative
- Remark: Higher variance at outer range
- Natural spline: Additional linearity boundary constraints
- Choosing the number and locations of the knots
- Remark: More knots = more flexible = coefficients change rapidly
- Uniform fashion
- Compare with Polynomial Regression
- Different ways to introduce flexibility \(\rightarrow\) more stable
Smoothing splines:
g = \(\arg\min_g \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2dt\)
- Remark: g is VERY flexible (interpolates all \(y_i\) makes RSS = 0) - Roughness: second derivative = how fast first derivative change - Remark: Integration = total change in the first derivative - \(\lambda\): \(\lambda \rightarrow \infty\), g becomes very smooth - Remark: shrunken version of natural cubic spline - Remark: As \(\lambda\) increases \(0 \rightarrow \infty\), \(df_{\lambda}\) decreases \(n\rightarrow 2\) - Effective degrees of freedom: measure the flexibility of the spline
Local regression:
- Compute the fit at target point using nearby training observations
- $$K_{i0}$$ different for each value of $$x_0
- **Memory-based**: Similar to KNN (need neighbor to compute)
- **s**: span, proportion of points to compute local regression, similar to $$\lambda
- **Remark**: smaller = more wiggly
- Poor performance in high dimension (Curse of dimensionality)
Generalized additive models: To be written in the future
Tree-based
- Basics of Decision Trees
- Regression Trees
- Stratification of Feature Space
- Tree Pruning
- Classification Trees
- Trees vs Linear models
- Advantages and Disadvantages
- Bagging, RF, Boosting, Bayesian Additive Regression Trees
- Bagging
- Out-of-Bag error estimation
- Variable Important measures
- RF
- Boosting
- Bayesian Additive Regression Trees: To be added in the future
- Bagging
SVM
- Maximal Margin Classifier
- Hyperplane
- Classification using separating hyperplane
- The classifier
- Construction of the classifier
- Non-separable cases
- Support Vector Classifier
- Overview
- Detail
- Support Vector Machines
- Non-linear Decision Boundaries
- More than 2 cases
- One vs one
- One vs all
- Relationship to logistic regression