Statistical Learning

Terminology

Input: predictors, independent variables, features, variables
Output: responses, dependent variables

Prediction

Predict \(\hat Y = \hat f(X)\)
The accuracy of \(\hat Y\) depends on the reducible error and irreducible error
Reducible error: The difference between \(\hat f\) and true \(f\)
Irreducible error: \(f\) also depends on \(\epsilon\), unmeasurable, unknown

\[\mathbb E[(Y - \hat Y)^2] = (f(X) - \hat f(X))^2 + \text{Var}[\epsilon]\]

Inference

Which predictors are associated with Y?
What is the relationship between Y and each predictor?
Is a linear equation adequate to represent the relationship between Y and predictors?

How to estimate f

Parametric: Make assumption about functional form, uses training data to fit the model, easier to estimate a set of parameters, \(\hat f\) may be far from \(f\).
Non-parametric: No assumption about functional form, estimate f that close to data points without being too rough or wiggly, requires a very large number of observations.

Accuracy vs Interpretability (vs Flexibility)

More flexible = (maybe) more accurate = Less interpretability

Assess model accuracy

No Free Lunch theorem: no method dominates all others over all possible data sets.
Measuring the Quality of Fit: MSE for regression, error rate for classification.
Training vs Testing error: Monotonic decreasing and U-shape, training error is always smaller since we directly optimize.
Overfitting vs Underfitting: U-shape testing error, both training and testing error are high.

Bias-Variance Trade-off

\[\mathbb E[(y_0 - \hat f(x_0))^2] = \text{Var}[\hat f(x_0)] + \text{Bias}(\hat f(x_0))^2 + \text{Var}[\epsilon]\]

Remark: Expected MSE cannot lie below \(Var[\epsilon]\)
Remark: We want to achieve simultaneously low bias and low variance
Variance: How \(\hat f\) changes if we shift one of the data point
Bias: Error introduced by approximating a real-life problem
- Remark: High flexible = High variance = Less bias
- Remark: The rate of changing bias and variance matters

The Bayes classifier

Assigns each observation to most likely class given its predictor values
Bayes error rate: \(1 - \max_j \Pr(Y = j \|X = x_0)\), analogous to irreducible error
- Gold standard to compare other methods
K-nearest neighbor: Unknown \(P(Y = j \| X= x_0)\)
\[\max_{j} \Pr(Y = j \| X = x_0 ) = \frac{1}{k}\sum_{i\in N_0} I(y_i = j)\]
- Remark: Close to optimal Bayes classifier
- Remark: Selecting K matters (bias-variance trade-off)

Linear Regression

Question

Is there a relationship between response and predictors?
Is the relationship linear?
How strong the relationship is?
Is there an interaction (synergy) effect among predictors?
How accurately our prediction is?

Formulation

\[Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \epsilon\]

Assumption:\(\epsilon\):\(\perp\!\!\!\!\perp X\) , \(\epsilon_i \perp\!\!\!\!\perp \epsilon_j\), catch-all we miss (nonlinear relationship, missing predictors, measurement error)
Residual: \(e_i = y_i - \hat y_i\)
Objective: \(\min RSS = e_1^2 + ... + e_n^2\)
Estimating coefficients: Least squares solution
Interpretation:
- \(\hat\beta_0\): Average value of Y
- \(\hat\beta_j\): Average increase in Y if we increase \(X_j\) by 1

Simple Linear Regression

Number of predictor \(p = 1\)
Assessing model:
- Residual standard error: measure lack of fit, an estimate of standard deviation of \(\epsilon\)
  \[\hat\sigma = RSE = \sqrt{\frac{RSS}{n-p-1}}\]
- \(R^2\)-statistics: absolute measure of lack of fit, proportion of variability explained by regression
  \[0 \le R^2 = 1 - \frac{RSS}{TSS} \le 1\]
  - RSS: amount of variability left after performing regression
  - TSS: total variance in response Y - amount of variability before regressing (only \(\bar y\))
  - Interpretational advantage over RSE, but what is good \(R^2\)?
  - Remark: \(R^2 = r^2\) (correlation)
- Others: confidence interval, hypothesis testing, p-value

Multiple Linear Regression

Number of predictor \(p > 1\)
Is there relationship between response and predictors?: Hypothesis testing (F-test)
- All predictors
- Linearity assumption satisfies: \(E[\frac{RSS}{n-p-1}] = \sigma^2\)
- \(H_0\): \(E[\frac{RSS}{n-p=1}]=\sigma^2\)
- Remark: Test all together, instead of individual
Which predictors are important?
- Criteria: Mallow’s \(C_p\), AIC, BIC, Adjusted-\(R^2\)
- Procedure: Forward selection, backward selection, or mixed
How good model fits?: RSE and \(R^2\)-statistics
- Remark: The relative change between p and RSS
- Remark: More variables always increase \(R^2\) (overfitting)
How accurate the prediction?
- Reducible error: Inaccuracy between \(f(X)\) and \(\hat Y\), model bias
  - Confidence interval: interval contains the true value \(f(X)\), uncertainty around average Y over large X
- Irreducible error: Inaccuracy between \(Y\) and \(\hat Y\), random error \(\epsilon\)
  - Prediction interval: interval contains the true value of Y, uncertainty around Y over a particular X

Qualitative response variable

Dummy variable: incorporating qualitative variable into regression analysis
Coding scheme 0/1 (One-hot encoding):
- \(\beta_0\): Average Y when X = 0
- \(\beta_1\): The difference in average Y when X = 0 and X = 1
Coding scheme -1/1:
- \(\beta_0\): Average Y (ignore X)
- \(\beta_1\): Amount each X have Y that above or below average
Remark: Coding scheme does not affect the fit
More than 2 level: Always one fewer dummy variable than number of levels
- No dummy variable = baseline

Extension of linear model

Remove additive assumption:
- Increase in \(X_j\) associated with one-unit increase in \(X_k\)
- Hierarchical principle: If we include the interaction term, we also should include the main effect even if the p-value associated with the coefficients are not significant.
Nonlinear relationship: Polynomial regression

Potential Problems

Nonlinear between response and predictors
- Residual plot: Observe any discerning patterns
- Solution: transformation on X
Correlation among error terms
- Underestimate the true standard error (Accidentally double the observations)
- Very importance to linear regression
Non-constant variance of error term
- Residual plot: Funnel shape in residual plot
- Solution: transformation on Y
Outliers: Point where \(y_i\) is unusual given \(x_i\)
- Remark: Outliers can have no effect on least square fit
- Studentized residuals: Detect outliers
High-leverage points: Point where \(x_i\) is unusual
- Remark: Removing high leverage points are important
- Remark: High-leverage points are outliers, but not vice versa
- Leverage statistics: To detect high leverage points!
\[h_i = \frac{1}{n} + \frac{(x_i - \bar x)^2}{\sum_{i'=1}^n (x_{i'} - \bar x)^2}\]
- Between 1 and \(\frac{1}{n}\) and the average value ALWAYS equals \(\frac{p+1}{n}\)
Collinearity: Two or more variables are closely related to each other
- Difficult to separate out the individual effects
- Inaccurate \(\hat\beta_j\) and \(SE[\hat\beta_j]\) increases, power of the hypothesis test reduces
- Correlation matrix: Good for pairs of variables
- Multicollinearity: Variance inflation factor
- Solution: Drop or combine collinear variables together into a single predictor/

Comparsion with KNN

Parametric vs non-parametric
Linearity assumption: LR
Non-linearity in low dimension: LR
Non-linearity in high dimension: KNN
Small observations: LR (Curse of dimensionality)

Logistic Regression

Why not Linear Regression

Coding scheme: implies outcomes ordering.
Linear Regression: Hard to interpret as probability, and impossible for more than 2 classes

Logistic Model

Logistic function: \(p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\)
Odds: \(\frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1 X}\)
Logit odds: \(\log \frac{p(X)}{1-p(X)} = \beta_0 + \beta_1 X\)
- \(\beta_1\): Increases X by 1 changes the logit odds by \(\beta_1\).

Estimating coefficients

Maximum likelihood function:

\[l(\beta_0, \beta_1) = \prod_{i:y_i = 1}p(x_i) \prod_{i':y_{i'}=0} (1 - p(x_{i'}))\]

Predictions: Possible to have qualitative predictors (using dummy variables)

Multiple Logistic Regression

\[p(Y=1\|X) = p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}\]

Multinomial Logistic Regression

Baseline encoding:

\[P(Y = k \| X = x) = \frac{e^{\beta_{k0} + \beta_{k1} X_1 + ... + \beta_{kp} X_p}}{1 + \sum_{l=1}^{K-1}e^{\beta_{l0} + \beta_{l1} X_1 + ... + \beta_{lp} X_p}}, \quad \forall k = 1,...,K-1\] \[P(Y = k \| X = x) = \frac{1}{1 + \sum_{l=1}^{K-1}e^{\beta_{l0} + \beta_{l1} X_1 + ... + \beta_{lp} X_p}}, \quad k = K\]

Remark: The choice of baseline will change the coefficients, but the fitted value will be the same.
Softmax encoding: No baseline

\[P(Y = k \| X = x) = \frac{e^{\beta_{k0} + \beta_{k1} X_1 + ... + \beta_{kp} X_p}}{\sum_{l=1}^{K}e^{\beta_{l0} + \beta_{l1} X_1 + ... + \beta_{lp} X_p}}\]

Generative Models for classification

Motivation

Condition: Substantial separation between 2 classes, small sample size, predictors are approximately normal.

\[p_k(x) = P(Y = k \| X= x) = \frac{P(Y = k)P(X = x\|Y = k)}{\sum_{l=1}^K P(Y = l) P(X =x \| Y = l)} = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}\]

Challenge: estimating \(f_k(x) \rightarrow\) simplifying assumption

Linear Discriminant Analysis p = 1

Assumption: \(f_k(x) \sim N(\mu_k, \sigma_k^2)\) and \(\sigma_1^2 = ... = \sigma_K^2 = \sigma^2\)

\[\max_k p_k(x) =\delta_k(x) = x \times\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)\]

Remark: Linear in terms of x
Challenge: \(\mu_k\) and \(\sigma^2\) are unknown. Then, estimate

\[\hat\mu_k = \frac{1}{n_k}\sum_{i:y_i = k}x_i\] \[\hat\sigma^2 = \frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i = k}(x_i - \hat\mu_k)^2\] \[\hat\pi_k = \frac{n_k}{n}\]

Linear Discriminant Analysis p > 1

Assumption: \(f_k(x)\sim N(\mu_k, \Sigma)\)

\[\max_k p_k(x) = \delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log\pi_k\]

Decision boundary: \(\delta_k(x) = \delta_l(x)\)
Remark: Prefer when more than 2 classes (view data in low dimension)

Metric

Confusion matrix
Sensitivity: Percent of TP
Specificity: Percent of TN
- Bayes classifier does poorly in Specificity since minimizing total error rate
- Threshold (in K = 2) affects specificity and sensitivity \(\rightarrow\) domain knowledge
ROC curve: Overall performance of the classifier, all possible threshold, given by area under curve (AUC)
- False positive rate = 1 - Specificity = Type I error
- True positive rate = Sensitivity = Recall = 1 - Type II error = Power
- Precision = Among positive prediction, how many actually TP

Quadratic Discriminant Analysis

Assumption: \(f_k(x)\sim N(\mu_k, \Sigma_k)\)

\[\max_k p_k(x) = \delta_k(x) = -\frac{1}{2}x^T\Sigma_k^{-1}x + x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k - \frac{1}{2}\log\|\Sigma_k\| + \log\pi_k\]

Challenge: High variance \(\rightarrow\) work better with more observations (bias-variance trade-off)

Naive Bayes

Assumption: Among kth class, p predictors are independent, \(f_k(x) = f_{k1}(x_1) \times... \times f_{kp}(x_p)\)
Remark: Works well when n < p (reduces variance)

\[P(Y = k \| X = x) = \frac{\pi_k \times f_{k1}(x_1) \times ...\times f_{kp}(x_p)}{\sum_{l=1}^K \pi_l \times f_{l1}(x_1) \times ... \times f_{lp}(x_p)}\]

If \(X_j\) is quantitative, \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) (Similar to QDA with diagonal covariance matrix) or nonparametric (kernel density estimator)
If \(X_j\) is qualitative, count the proportion of training obs for jth predictor corresponding to each class

Comparison of all methods

NB takes a form generalized additive model
LDA is a special case of QDA
Any linear boundary classifiers are special cases of NB with \(g_{kj} = b_{kj}x_j\)
NB when \(X_j\) is quantitative and \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) are special case of QDA (\(\Sigma\) is a diagonal matrix)
LDA will be better than Logistic Regression when Normal assumption does not hold.
KNN is better when decision boundary is highly non-linear and n » p. If the relationship is non-linear and n \(\approx\) p, QDA is prefer.

Generalized Linear Model

Y belongs to Exponential Family Distribution
Poisson Regression: \(\lambda(X_1,...,X_p) = e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}\) and use MLE to find \(\hat\beta\)
- Interpretation: Increases \(X_j\) by one unit with a change in \(\mathbb E[Y]\) by \(\exp(\beta_j)\)
- Mean variance relationship: \(Var[Y] = \mathbb E[Y] = \lambda\)
- Nonnegative fitted values
Link function: Transform mean of the response so that the transformed mean is a linear function of predictors

Sampling Method

Model selection: Select proper level of flexibility
Model assessment: Estimate the test error

Cross-validation

Validation Set approach

Randomly divided train and valdiation set
Challenge: Overestimate and highly variable test error rate
High bias (overestimation), High variance

LOOCV

Leave one as validation and the rest as training set. Repeat n times
\[CV_{(n)} = \frac{1}{n}\sum_{i=1}^n MSE_{i}\]
Benefits: Unbiased estimate of test error rate, less variance and bias in train/valid split.
Challenge: expensive
Remark: Less expensive for LSS or polynomial regression (exact solution)

k-fold CV

Randomly divide the dataset into k groups
\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\]
LOOCV is a special case
Benefits: Less expensive, more accurate estimation of test error rate
Challenge: More variance (more random in split), more bias (less in training)
Classification: Similar idea, but can sometimes underestimate the test error rate!

Bootstrap

Apply when difficult to obtain a measure of variability.
Problem: Data cannot be generated from original population.
Solution: Repeated sampling (with replacement) observations from the original dataset \(\rightarrow\) obtain the estimation and use the formula to obtain the standard error of the estimation.

Regularization

Motivation

Instead of using LS model
Model interpretability: Irrelevant variables leads to unnecessary complexity, which makes harder to interpret.

Subset selection

Best Subset selection: \(2^p - 1\) models
- For each k (0,…, p-1) predictors \(\rightarrow\) Fit \(p - k\) models with one more predictors \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
- Challenge: expensive

Stepwise selection

Motivation: Large search space \(\rightarrow\) overfitting and high variance in the estimate
Forward:
- For each k (1,…, p) predictors \(\rightarrow\) Fit \(p - k\) models with one additional models \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
- Do well in practice
- Not guarantee to find best model
- Possible to apply when n < p
Backward:
- For each k (p, p-1, …, 1) predictors \(\rightarrow\) Fit \(k\) models with one less predictor than previously \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
- Requires n > p
Hybrid: Combination of two previous ones

Choosing the optimal model

Indirectly estimate testing error: adjust training error
- Assumption about true underlying model
- \(C_p\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
  - \(\hat\sigma^2\): estimate of \(\epsilon\) (full-model)
  - More features (d) will increase \(C_p\)
  - \(E[C_p]\) = Test MSE (unbiased)
- \(AIC\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
  - Class of models fit by MLE
  - Proportional to \(C_p\)
- \(BIC\): \(\frac{1}{n}(RSS + \log(n)d\hat\sigma^2)\)
  - Bayesian POV
  - Heavier penalty for large model
- Adjusted-\(R^2\): \(1\frac{RSS/(n-d-1)}{TSS/(n-1)}\)
  - Adding more features reduces RSS, and reduces (n-d-1). Therefore, the relative change matters.
  - Not well motivated in statistical theory
  - Large value is better (Opposite with above three criterion)
Directly estimate testing error: validation set approach or cross-validation
- Fewer assumption about true underlying model
- Prefer when it is hard to pinpoint df and \(\sigma^2\)

Shrinkage

Ridge regression:

\[RSS + \lambda\sum_{j=1}^p \beta_j^2\]

\(\lambda\): shrinkage penalty, shrink \(\beta_j\) toward zero
Remark: shrinkage penalty is not applied to \(\beta_0\) (measure of the mean value of the response)
Remark: \(\lambda\) increases may increase estimated coefficients
Remark: LS solution is scale equivariant, while RR coeff is not.
Remark: Standardizing features before applying ridge regression
Bias-variance trade-off: Increase bias, reduce variance (flexibility)
Remark: Shrink coefficients towards, but not exactly, zero
\(\min_\beta \|\|Y - X\beta\|\|^2\) subject to \(\sum_{j=1}^p \beta_j^2 \le s\)
Remark: Normal distribution prior and follows by posterior mode/mean for \(\beta\)
LASSO:

\[RSS + \lambda\sum_{j=1}^p \|\beta_j\|\]

Model interpretation/Variable selection: Forcing some coefficients to be exactly zero
Sparse model: \(\lambda\) is sufficiently large
\(\min_\beta \|Y - X\beta\|^2\) subject to \(\sum_{j=1}^p \|\beta_j\| \le s\)
Closely related to Best Subset selection: \(\min_\beta \|Y - X\beta\|^2\) subject to \(\sum_{j=1}^p I(\beta_j\not = 0) \le s\)
Remark: Corner solution
Remark: Neither LASSO nor Ridge regression universally dominates the other
Remark: Since the derivative of absolute function does not exist at 0, we need soft thresholding (explicitly set coefficients to be 0)
Remark: Laplace distribution prior and follows posterior mode (not mean)

Dimension reduction: To be written in the near future

Consideration in high dimension: To be written in the near future

Beyond Linearity

Polynomial regression

\[y_i = \beta_0 + \beta_1 x_i + ... +\beta_d x_i^d +\epsilon\]

Extremely non-linear curve
Remark: d < 5, otherwise overly flexible and strange shapes
Variance of the fit: point-wise square-root of the variance estimate for each coefficients
Applicable for linear and logistic regression

Step functions:

\[y_i = \beta_0 + \beta_1 C_1(x_i) + ... + \beta_K C_K(x_i) + \epsilon\]

Global structure on the non-linear function
Continuous variable \(\rightarrow\) ordered categorical variable
\(C_K(x) = I(c_K \le x \le c_{K+1})\): indicator function, dummy variable
\(\beta_0\): Mean value of Y for \(X < c_1\)
\(\beta_j\): Average increase in Y for X in \(c_j \le X < c_{j=1}\) relative to \(X < c_1\)
Remark: Unless natural breakpoints, miss the action

Basis functions:

\[y_i = \beta_0 + \beta_1 b_1(x_i) + ... + \beta_K b_K(x_i) + \epsilon\]

\(b_k\): fixed and known function (Step and polynomial regression are special case of this)
Applicable for OLS

Regression splines:

Piecewise Polynomials: \(y_i = \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \epsilon\) if \(x_i < c\) otherwise \(\beta_{02} + \beta_{12} x_i + \beta_{22}x_i + \epsilon\)
- Knots: c, more knots = more flexible
- Remark: Discontinuous (too flexible)
- Remark: 1 knot and 4 parameters \(\rightarrow\) 8 parameters in total
Constraints: first up to K-1 order derivative must be continuous
- Remark: Every constraint frees one degree of freedom
- Remark: Cubic splines has K (knots) + 4 (\(\beta_0, \beta_1, \beta_2, \beta_3\)) degrees of freedom
Spline Basis Representation:
- Truncated Power basis (cubic spline): \(h(x, \xi) = (x - \xi)_+^3\) with \(\xi\) is the location of the knot
- Remark: Discontinuity in third derivative
- Remark: Higher variance at outer range
Natural spline: Additional linearity boundary constraints
Choosing the number and locations of the knots
- Remark: More knots = more flexible = coefficients change rapidly
- Uniform fashion
Compare with Polynomial Regression
- Different ways to introduce flexibility \(\rightarrow\) more stable

Smoothing splines:

g = \(\arg\min_g \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2dt\)

Remark: g is VERY flexible (interpolates all \(y_i\) makes RSS = 0) - Roughness: second derivative = how fast first derivative change - Remark: Integration = total change in the first derivative - \(\lambda\): \(\lambda \rightarrow \infty\), g becomes very smooth - Remark: shrunken version of natural cubic spline - Remark: As \(\lambda\) increases \(0 \rightarrow \infty\), \(df_{\lambda}\) decreases \(n\rightarrow 2\) - Effective degrees of freedom: measure the flexibility of the spline

Local regression:

- Compute the fit at target point using nearby training observations
- $$K_{i0}$$ different for each value of $$x_0
- **Memory-based**: Similar to KNN (need neighbor to compute)
- **s**: span, proportion of points to compute local regression, similar to $$\lambda
  - **Remark**: smaller = more wiggly
- Poor performance in high dimension (Curse of dimensionality)

Generalized additive models: To be written in the future

Tree-based

Basics of Decision Trees
Regression Trees
- Stratification of Feature Space
- Tree Pruning
Classification Trees
Trees vs Linear models
Advantages and Disadvantages
Bagging, RF, Boosting, Bayesian Additive Regression Trees
- Bagging
  - Out-of-Bag error estimation
  - Variable Important measures
- RF
- Boosting
- Bayesian Additive Regression Trees: To be added in the future

SVM

Maximal Margin Classifier
- Hyperplane
- Classification using separating hyperplane
- The classifier
- Construction of the classifier
- Non-separable cases
Support Vector Classifier
- Overview
- Detail
Support Vector Machines
- Non-linear Decision Boundaries
- More than 2 cases
  - One vs one
  - One vs all
- Relationship to logistic regression

Statistical Learning

Terminology

Prediction

Inference

How to estimate f

Accuracy vs Interpretability (vs Flexibility)

Assess model accuracy

Bias-Variance Trade-off

The Bayes classifier

Linear Regression

Question

Formulation

Simple Linear Regression

Multiple Linear Regression

Qualitative response variable

Extension of linear model

Potential Problems

Comparsion with KNN

Logistic Regression

Why not Linear Regression

Logistic Model

Estimating coefficients

Multiple Logistic Regression

Multinomial Logistic Regression

Generative Models for classification

Motivation

Linear Discriminant Analysis p = 1

Linear Discriminant Analysis p > 1

Metric

Quadratic Discriminant Analysis

Naive Bayes

Comparison of all methods

Generalized Linear Model

Sampling Method

Cross-validation

Validation Set approach

LOOCV

k-fold CV

Bootstrap

Regularization

Motivation

Subset selection

Stepwise selection

Choosing the optimal model

Shrinkage

Dimension reduction: To be written in the near future

Consideration in high dimension: To be written in the near future

Beyond Linearity

Polynomial regression

Step functions:

Basis functions:

Regression splines:

Smoothing splines:

Local regression:

Generalized additive models: To be written in the future

Tree-based

SVM

Neural Network

Unsupervised Learning