Statistical learning
Statistical Learning
Terminology
 Input: predictors, independent variables, features, variables
 Output: responses, dependent variables
Prediction
 Predict \(\hat Y = \hat f(X)\)
 The accuracy of \(\hat Y\) depends on the reducible error and irreducible error
 Reducible error: The difference between \(\hat f\) and true \(f\)
 Irreducible error: \(f\) also depends on \(\epsilon\), unmeasurable, unknown
Inference
 Which predictors are associated with Y?
 What is the relationship between Y and each predictor?
 Is a linear equation adequate to represent the relationship between Y and predictors?
How to estimate f
 Parametric: Make assumption about functional form, uses training data to fit the model, easier to estimate a set of parameters, \(\hat f\) may be far from \(f\).
 Nonparametric: No assumption about functional form, estimate f that close to data points without being too rough or wiggly, requires a very large number of observations.
Accuracy vs Interpretability (vs Flexibility)
 More flexible = (maybe) more accurate = Less interpretability
Assess model accuracy
 No Free Lunch theorem: no method dominates all others over all possible data sets.
 Measuring the Quality of Fit: MSE for regression, error rate for classification.
 Training vs Testing error: Monotonic decreasing and Ushape, training error is always smaller since we directly optimize.
 Overfitting vs Underfitting: Ushape testing error, both training and testing error are high.
BiasVariance Tradeoff
\[\mathbb E[(y_0  \hat f(x_0))^2] = \text{Var}[\hat f(x_0)] + \text{Bias}(\hat f(x_0))^2 + \text{Var}[\epsilon]\] Remark: Expected MSE cannot lie below \(Var[\epsilon]\)
 Remark: We want to achieve simultaneously low bias and low variance
 Variance: How \(\hat f\) changes if we shift one of the data point
 Bias: Error introduced by approximating a reallife problem
 Remark: High flexible = High variance = Less bias
 Remark: The rate of changing bias and variance matters
The Bayes classifier
 Assigns each observation to most likely class given its predictor values
 Bayes error rate: \(1  \max_j \Pr(Y = j \X = x_0)\), analogous to irreducible error
 Gold standard to compare other methods

Knearest neighbor: Unknown \(P(Y = j \ X= x_0)\)
\[\max_{j} \Pr(Y = j \ X = x_0 ) = \frac{1}{k}\sum_{i\in N_0} I(y_i = j)\] Remark: Close to optimal Bayes classifier
 Remark: Selecting K matters (biasvariance tradeoff)
Linear Regression
Question
 Is there a relationship between response and predictors?
 Is the relationship linear?
 How strong the relationship is?
 Is there an interaction (synergy) effect among predictors?
 How accurately our prediction is?
Formulation
\[Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \epsilon\] Assumption:\(\epsilon\):\(\perp\!\!\!\!\perp X\) , \(\epsilon_i \perp\!\!\!\!\perp \epsilon_j\), catchall we miss (nonlinear relationship, missing predictors, measurement error)
 Residual: \(e_i = y_i  \hat y_i\)
 Objective: \(\min RSS = e_1^2 + ... + e_n^2\)
 Estimating coefficients: Least squares solution
 Interpretation:
 \(\hat\beta_0\): Average value of Y
 \(\hat\beta_j\): Average increase in Y if we increase \(X_j\) by 1
Simple Linear Regression

Number of predictor \(p = 1\)

Assessing model:

Residual standard error: measure lack of fit, an estimate of standard deviation of \(\epsilon\)
\[\hat\sigma = RSE = \sqrt{\frac{RSS}{np1}}\] 
\(R^2\)statistics: absolute measure of lack of fit, proportion of variability explained by regression
\[0 \le R^2 = 1  \frac{RSS}{TSS} \le 1\] RSS: amount of variability left after performing regression
 TSS: total variance in response Y  amount of variability before regressing (only \(\bar y\))
 Interpretational advantage over RSE, but what is good \(R^2\)?
 Remark: \(R^2 = r^2\) (correlation)

Others: confidence interval, hypothesis testing, pvalue

Multiple Linear Regression
 Number of predictor \(p > 1\)
 Is there relationship between response and predictors?: Hypothesis testing (Ftest)
 All predictors
 Linearity assumption satisfies: \(E[\frac{RSS}{np1}] = \sigma^2\)
 \(H_0\): \(E[\frac{RSS}{np=1}]=\sigma^2\)
 Remark: Test all together, instead of individual
 Which predictors are important?
 Criteria: Mallow’s \(C_p\), AIC, BIC, Adjusted\(R^2\)
 Procedure: Forward selection, backward selection, or mixed
 How good model fits?: RSE and \(R^2\)statistics
 Remark: The relative change between p and RSS
 Remark: More variables always increase \(R^2\) (overfitting)
 How accurate the prediction?
 Reducible error: Inaccuracy between \(f(X)\) and \(\hat Y\), model bias
 Confidence interval: interval contains the true value \(f(X)\), uncertainty around average Y over large X
 Irreducible error: Inaccuracy between \(Y\) and \(\hat Y\), random error \(\epsilon\)
 Prediction interval: interval contains the true value of Y, uncertainty around Y over a particular X
 Reducible error: Inaccuracy between \(f(X)\) and \(\hat Y\), model bias
Qualitative response variable
 Dummy variable: incorporating qualitative variable into regression analysis
 Coding scheme 0/1 (Onehot encoding):
 \(\beta_0\): Average Y when X = 0
 \(\beta_1\): The difference in average Y when X = 0 and X = 1
 Coding scheme 1/1:
 \(\beta_0\): Average Y (ignore X)
 \(\beta_1\): Amount each X have Y that above or below average
 Remark: Coding scheme does not affect the fit
 More than 2 level: Always one fewer dummy variable than number of levels
 No dummy variable = baseline
Extension of linear model
 Remove additive assumption:
 Increase in \(X_j\) associated with oneunit increase in \(X_k\)
 Hierarchical principle: If we include the interaction term, we also should include the main effect even if the pvalue associated with the coefficients are not significant.
 Nonlinear relationship: Polynomial regression
Potential Problems
 Nonlinear between response and predictors
 Residual plot: Observe any discerning patterns
 Solution: transformation on X
 Correlation among error terms
 Underestimate the true standard error (Accidentally double the observations)
 Very importance to linear regression
 Nonconstant variance of error term
 Residual plot: Funnel shape in residual plot
 Solution: transformation on Y
 Outliers: Point where \(y_i\) is unusual given \(x_i\)
 Remark: Outliers can have no effect on least square fit
 Studentized residuals: Detect outliers
 Highleverage points: Point where \(x_i\) is unusual
 Remark: Removing high leverage points are important
 Remark: Highleverage points are outliers, but not vice versa
 Leverage statistics: To detect high leverage points!
 Between 1 and \(\frac{1}{n}\) and the average value ALWAYS equals \(\frac{p+1}{n}\)
 Collinearity: Two or more variables are closely related to each other
 Difficult to separate out the individual effects
 Inaccurate \(\hat\beta_j\) and \(SE[\hat\beta_j]\) increases, power of the hypothesis test reduces
 Correlation matrix: Good for pairs of variables
 Multicollinearity: Variance inflation factor
 Solution: Drop or combine collinear variables together into a single predictor/
Comparsion with KNN
 Parametric vs nonparametric
 Linearity assumption: LR
 Nonlinearity in low dimension: LR
 Nonlinearity in high dimension: KNN
 Small observations: LR (Curse of dimensionality)
Logistic Regression
Why not Linear Regression
 Coding scheme: implies outcomes ordering.
 Linear Regression: Hard to interpret as probability, and impossible for more than 2 classes
Logistic Model
 Logistic function: \(p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\)
 Odds: \(\frac{p(X)}{1p(X)} = e^{\beta_0 + \beta_1 X}\)
 Logit odds: \(\log \frac{p(X)}{1p(X)} = \beta_0 + \beta_1 X\)
 \(\beta_1\): Increases X by 1 changes the logit odds by \(\beta_1\).
Estimating coefficients
 Maximum likelihood function:
 Predictions: Possible to have qualitative predictors (using dummy variables)
Multiple Logistic Regression
\[p(Y=1\X) = p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}\]Multinomial Logistic Regression
 Baseline encoding:

Remark: The choice of baseline will change the coefficients, but the fitted value will be the same.

Softmax encoding: No baseline
Generative Models for classification
Motivation
 Condition: Substantial separation between 2 classes, small sample size, predictors are approximately normal.
 Challenge: estimating \(f_k(x) \rightarrow\) simplifying assumption
Linear Discriminant Analysis p = 1
 Assumption: \(f_k(x) \sim N(\mu_k, \sigma_k^2)\) and \(\sigma_1^2 = ... = \sigma_K^2 = \sigma^2\)
 Remark: Linear in terms of x
 Challenge: \(\mu_k\) and \(\sigma^2\) are unknown. Then, estimate
Linear Discriminant Analysis p > 1
 Assumption: \(f_k(x)\sim N(\mu_k, \Sigma)\)
 Decision boundary: \(\delta_k(x) = \delta_l(x)\)
 Remark: Prefer when more than 2 classes (view data in low dimension)
Metric
 Confusion matrix
 Sensitivity: Percent of TP
 Specificity: Percent of TN
 Bayes classifier does poorly in Specificity since minimizing total error rate
 Threshold (in K = 2) affects specificity and sensitivity \(\rightarrow\) domain knowledge
 ROC curve: Overall performance of the classifier, all possible threshold, given by area under curve (AUC)
 False positive rate = 1  Specificity = Type I error
 True positive rate = Sensitivity = Recall = 1  Type II error = Power
 Precision = Among positive prediction, how many actually TP
Quadratic Discriminant Analysis
 Assumption: \(f_k(x)\sim N(\mu_k, \Sigma_k)\)
 Challenge: High variance \(\rightarrow\) work better with more observations (biasvariance tradeoff)
Naive Bayes
 Assumption: Among kth class, p predictors are independent, \(f_k(x) = f_{k1}(x_1) \times... \times f_{kp}(x_p)\)
 Remark: Works well when n < p (reduces variance)
 If \(X_j\) is quantitative, \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) (Similar to QDA with diagonal covariance matrix) or nonparametric (kernel density estimator)
 If \(X_j\) is qualitative, count the proportion of training obs for jth predictor corresponding to each class
Comparison of all methods
 NB takes a form generalized additive model
 LDA is a special case of QDA
 Any linear boundary classifiers are special cases of NB with \(g_{kj} = b_{kj}x_j\)
 NB when \(X_j\) is quantitative and \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) are special case of QDA (\(\Sigma\) is a diagonal matrix)
 LDA will be better than Logistic Regression when Normal assumption does not hold.
 KNN is better when decision boundary is highly nonlinear and n » p. If the relationship is nonlinear and n \(\approx\) p, QDA is prefer.
Generalized Linear Model
 Y belongs to Exponential Family Distribution
 Poisson Regression: \(\lambda(X_1,...,X_p) = e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}\) and use MLE to find \(\hat\beta\)
 Interpretation: Increases \(X_j\) by one unit with a change in \(\mathbb E[Y]\) by \(\exp(\beta_j)\)
 Mean variance relationship: \(Var[Y] = \mathbb E[Y] = \lambda\)
 Nonnegative fitted values
 Link function: Transform mean of the response so that the transformed mean is a linear function of predictors
Sampling Method
 Model selection: Select proper level of flexibility
 Model assessment: Estimate the test error
Crossvalidation
Validation Set approach
 Randomly divided train and valdiation set
 Challenge: Overestimate and highly variable test error rate
 High bias (overestimation), High variance
LOOCV

Leave one as validation and the rest as training set. Repeat n times
\[CV_{(n)} = \frac{1}{n}\sum_{i=1}^n MSE_{i}\]  Benefits: Unbiased estimate of test error rate, less variance and bias in train/valid split.
 Challenge: expensive
 Remark: Less expensive for LSS or polynomial regression (exact solution)
kfold CV

Randomly divide the dataset into k groups
\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\]  LOOCV is a special case
 Benefits: Less expensive, more accurate estimation of test error rate
 Challenge: More variance (more random in split), more bias (less in training)
 Classification: Similar idea, but can sometimes underestimate the test error rate!
Bootstrap
 Apply when difficult to obtain a measure of variability.
 Problem: Data cannot be generated from original population.
 Solution: Repeated sampling (with replacement) observations from the original dataset \(\rightarrow\) obtain the estimation and use the formula to obtain the standard error of the estimation.
Regularization
Motivation
 Instead of using LS model
 Model interpretability: Irrelevant variables leads to unnecessary complexity, which makes harder to interpret.
Subset selection
 Best Subset selection: \(2^p  1\) models
 For each k (0,…, p1) predictors \(\rightarrow\) Fit \(p  k\) models with one more predictors \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (crossvalidation error, \(C_p\), BIC, or adjusted\(R^2\))
 Challenge: expensive
Stepwise selection
 Motivation: Large search space \(\rightarrow\) overfitting and high variance in the estimate
 Forward:
 For each k (1,…, p) predictors \(\rightarrow\) Fit \(p  k\) models with one additional models \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (crossvalidation error, \(C_p\), BIC, or adjusted\(R^2\))
 Do well in practice
 Not guarantee to find best model
 Possible to apply when n < p
 Backward:
 For each k (p, p1, …, 1) predictors \(\rightarrow\) Fit \(k\) models with one less predictor than previously \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (crossvalidation error, \(C_p\), BIC, or adjusted\(R^2\))
 Requires n > p
 Hybrid: Combination of two previous ones
Choosing the optimal model
 Indirectly estimate testing error: adjust training error
 Assumption about true underlying model
 \(C_p\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
 \(\hat\sigma^2\): estimate of \(\epsilon\) (fullmodel)
 More features (d) will increase \(C_p\)
 \(E[C_p]\) = Test MSE (unbiased)
 \(AIC\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
 Class of models fit by MLE
 Proportional to \(C_p\)
 \(BIC\): \(\frac{1}{n}(RSS + \log(n)d\hat\sigma^2)\)
 Bayesian POV
 Heavier penalty for large model
 Adjusted\(R^2\): \(1\frac{RSS/(nd1)}{TSS/(n1)}\)
 Adding more features reduces RSS, and reduces (nd1). Therefore, the relative change matters.
 Not well motivated in statistical theory
 Large value is better (Opposite with above three criterion)
 Directly estimate testing error: validation set approach or crossvalidation
 Fewer assumption about true underlying model
 Prefer when it is hard to pinpoint df and \(\sigma^2\)
Shrinkage
 Ridge regression:
 \(\lambda\): shrinkage penalty, shrink \(\beta_j\) toward zero
 Remark: shrinkage penalty is not applied to \(\beta_0\) (measure of the mean value of the response)
 Remark: \(\lambda\) increases may increase estimated coefficients
 Remark: LS solution is scale equivariant, while RR coeff is not.
 Remark: Standardizing features before applying ridge regression
 Biasvariance tradeoff: Increase bias, reduce variance (flexibility)
 Remark: Shrink coefficients towards, but not exactly, zero
 \(\min_\beta \\Y  X\beta\\^2\) subject to \(\sum_{j=1}^p \beta_j^2 \le s\)

Remark: Normal distribution prior and follows by posterior mode/mean for \(\beta\)
 LASSO:
 Model interpretation/Variable selection: Forcing some coefficients to be exactly zero
 Sparse model: \(\lambda\) is sufficiently large
 \(\min_\beta \Y  X\beta\^2\) subject to \(\sum_{j=1}^p \\beta_j\ \le s\)
 Closely related to Best Subset selection: \(\min_\beta \Y  X\beta\^2\) subject to \(\sum_{j=1}^p I(\beta_j\not = 0) \le s\)
 Remark: Corner solution
 Remark: Neither LASSO nor Ridge regression universally dominates the other
 Remark: Since the derivative of absolute function does not exist at 0, we need soft thresholding (explicitly set coefficients to be 0)
 Remark: Laplace distribution prior and follows posterior mode (not mean)
Dimension reduction: To be written in the near future
Consideration in high dimension: To be written in the near future
Beyond Linearity
Polynomial regression
\[y_i = \beta_0 + \beta_1 x_i + ... +\beta_d x_i^d +\epsilon\] Extremely nonlinear curve
 Remark: d < 5, otherwise overly flexible and strange shapes
 Variance of the fit: pointwise squareroot of the variance estimate for each coefficients
 Applicable for linear and logistic regression
Step functions:
\[y_i = \beta_0 + \beta_1 C_1(x_i) + ... + \beta_K C_K(x_i) + \epsilon\] Global structure on the nonlinear function
 Continuous variable \(\rightarrow\) ordered categorical variable
 \(C_K(x) = I(c_K \le x \le c_{K+1})\): indicator function, dummy variable
 \(\beta_0\): Mean value of Y for \(X < c_1\)
 \(\beta_j\): Average increase in Y for X in \(c_j \le X < c_{j=1}\) relative to \(X < c_1\)
 Remark: Unless natural breakpoints, miss the action
Basis functions:
\[y_i = \beta_0 + \beta_1 b_1(x_i) + ... + \beta_K b_K(x_i) + \epsilon\] \(b_k\): fixed and known function (Step and polynomial regression are special case of this)
 Applicable for OLS
Regression splines:
 Piecewise Polynomials: \(y_i = \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \epsilon\) if \(x_i < c\) otherwise \(\beta_{02} + \beta_{12} x_i + \beta_{22}x_i + \epsilon\)
 Knots: c, more knots = more flexible
 Remark: Discontinuous (too flexible)
 Remark: 1 knot and 4 parameters \(\rightarrow\) 8 parameters in total
 Constraints: first up to K1 order derivative must be continuous
 Remark: Every constraint frees one degree of freedom
 Remark: Cubic splines has K (knots) + 4 (\(\beta_0, \beta_1, \beta_2, \beta_3\)) degrees of freedom
 Spline Basis Representation:
 Truncated Power basis (cubic spline): \(h(x, \xi) = (x  \xi)_+^3\) with \(\xi\) is the location of the knot
 Remark: Discontinuity in third derivative
 Remark: Higher variance at outer range
 Natural spline: Additional linearity boundary constraints
 Choosing the number and locations of the knots
 Remark: More knots = more flexible = coefficients change rapidly
 Uniform fashion
 Compare with Polynomial Regression
 Different ways to introduce flexibility \(\rightarrow\) more stable
Smoothing splines:
g = \(\arg\min_g \sum_{i=1}^n (y_i  g(x_i))^2 + \lambda \int g''(t)^2dt\)
 Remark: g is VERY flexible (interpolates all \(y_i\) makes RSS = 0)  Roughness: second derivative = how fast first derivative change  Remark: Integration = total change in the first derivative  \(\lambda\): \(\lambda \rightarrow \infty\), g becomes very smooth  Remark: shrunken version of natural cubic spline  Remark: As \(\lambda\) increases \(0 \rightarrow \infty\), \(df_{\lambda}\) decreases \(n\rightarrow 2\)  Effective degrees of freedom: measure the flexibility of the spline
Local regression:
 Compute the fit at target point using nearby training observations
 $$K_{i0}$$ different for each value of $$x_0
 **Memorybased**: Similar to KNN (need neighbor to compute)
 **s**: span, proportion of points to compute local regression, similar to $$\lambda
 **Remark**: smaller = more wiggly
 Poor performance in high dimension (Curse of dimensionality)
Generalized additive models: To be written in the future
Treebased
 Basics of Decision Trees
 Regression Trees
 Stratification of Feature Space
 Tree Pruning
 Classification Trees
 Trees vs Linear models
 Advantages and Disadvantages
 Bagging, RF, Boosting, Bayesian Additive Regression Trees
 Bagging
 OutofBag error estimation
 Variable Important measures
 RF
 Boosting
 Bayesian Additive Regression Trees: To be added in the future
 Bagging
SVM
 Maximal Margin Classifier
 Hyperplane
 Classification using separating hyperplane
 The classifier
 Construction of the classifier
 Nonseparable cases
 Support Vector Classifier
 Overview
 Detail
 Support Vector Machines
 Nonlinear Decision Boundaries
 More than 2 cases
 One vs one
 One vs all
 Relationship to logistic regression