Statistical Learning

Terminology

  • Input: predictors, independent variables, features, variables
  • Output: responses, dependent variables

Prediction

  • Predict \(\hat Y = \hat f(X)\)
  • The accuracy of \(\hat Y\) depends on the reducible error and irreducible error
  • Reducible error: The difference between \(\hat f\) and true \(f\)
  • Irreducible error: \(f\) also depends on \(\epsilon\), unmeasurable, unknown
\[\mathbb E[(Y - \hat Y)^2] = (f(X) - \hat f(X))^2 + \text{Var}[\epsilon]\]

Inference

  • Which predictors are associated with Y?
  • What is the relationship between Y and each predictor?
  • Is a linear equation adequate to represent the relationship between Y and predictors?

How to estimate f

  • Parametric: Make assumption about functional form, uses training data to fit the model, easier to estimate a set of parameters, \(\hat f\) may be far from \(f\).
  • Non-parametric: No assumption about functional form, estimate f that close to data points without being too rough or wiggly, requires a very large number of observations.

Accuracy vs Interpretability (vs Flexibility)

  • More flexible = (maybe) more accurate = Less interpretability

Assess model accuracy

  • No Free Lunch theorem: no method dominates all others over all possible data sets.
  • Measuring the Quality of Fit: MSE for regression, error rate for classification.
  • Training vs Testing error: Monotonic decreasing and U-shape, training error is always smaller since we directly optimize.
  • Overfitting vs Underfitting: U-shape testing error, both training and testing error are high.

Bias-Variance Trade-off

\[\mathbb E[(y_0 - \hat f(x_0))^2] = \text{Var}[\hat f(x_0)] + \text{Bias}(\hat f(x_0))^2 + \text{Var}[\epsilon]\]
  • Remark: Expected MSE cannot lie below \(Var[\epsilon]\)
  • Remark: We want to achieve simultaneously low bias and low variance
  • Variance: How \(\hat f\) changes if we shift one of the data point
  • Bias: Error introduced by approximating a real-life problem
    • Remark: High flexible = High variance = Less bias
    • Remark: The rate of changing bias and variance matters

The Bayes classifier

  • Assigns each observation to most likely class given its predictor values
  • Bayes error rate: \(1 - \max_j \Pr(Y = j \|X = x_0)\), analogous to irreducible error
    • Gold standard to compare other methods
  • K-nearest neighbor: Unknown \(P(Y = j \| X= x_0)\)

    \[\max_{j} \Pr(Y = j \| X = x_0 ) = \frac{1}{k}\sum_{i\in N_0} I(y_i = j)\]
    • Remark: Close to optimal Bayes classifier
    • Remark: Selecting K matters (bias-variance trade-off)

Linear Regression

Question

  • Is there a relationship between response and predictors?
  • Is the relationship linear?
  • How strong the relationship is?
  • Is there an interaction (synergy) effect among predictors?
  • How accurately our prediction is?

Formulation

\[Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \epsilon\]
  • Assumption:\(\epsilon\):\(\perp\!\!\!\!\perp X\) , \(\epsilon_i \perp\!\!\!\!\perp \epsilon_j\), catch-all we miss (nonlinear relationship, missing predictors, measurement error)
  • Residual: \(e_i = y_i - \hat y_i\)
  • Objective: \(\min RSS = e_1^2 + ... + e_n^2\)
  • Estimating coefficients: Least squares solution
  • Interpretation:
    • \(\hat\beta_0\): Average value of Y
    • \(\hat\beta_j\): Average increase in Y if we increase \(X_j\) by 1

Simple Linear Regression

  • Number of predictor \(p = 1\)

  • Assessing model:

    • Residual standard error: measure lack of fit, an estimate of standard deviation of \(\epsilon\)

      \[\hat\sigma = RSE = \sqrt{\frac{RSS}{n-p-1}}\]
    • \(R^2\)-statistics: absolute measure of lack of fit, proportion of variability explained by regression

      \[0 \le R^2 = 1 - \frac{RSS}{TSS} \le 1\]
      • RSS: amount of variability left after performing regression
      • TSS: total variance in response Y - amount of variability before regressing (only \(\bar y\))
      • Interpretational advantage over RSE, but what is good \(R^2\)?
      • Remark: \(R^2 = r^2\) (correlation)
    • Others: confidence interval, hypothesis testing, p-value

Multiple Linear Regression

  • Number of predictor \(p > 1\)
  • Is there relationship between response and predictors?: Hypothesis testing (F-test)
    • All predictors
    • Linearity assumption satisfies: \(E[\frac{RSS}{n-p-1}] = \sigma^2\)
    • \(H_0\): \(E[\frac{RSS}{n-p=1}]=\sigma^2\)
    • Remark: Test all together, instead of individual
  • Which predictors are important?
    • Criteria: Mallow’s \(C_p\), AIC, BIC, Adjusted-\(R^2\)
    • Procedure: Forward selection, backward selection, or mixed
  • How good model fits?: RSE and \(R^2\)-statistics
    • Remark: The relative change between p and RSS
    • Remark: More variables always increase \(R^2\) (overfitting)
  • How accurate the prediction?
    • Reducible error: Inaccuracy between \(f(X)\) and \(\hat Y\), model bias
      • Confidence interval: interval contains the true value \(f(X)\), uncertainty around average Y over large X
    • Irreducible error: Inaccuracy between \(Y\) and \(\hat Y\), random error \(\epsilon\)
      • Prediction interval: interval contains the true value of Y, uncertainty around Y over a particular X

Qualitative response variable

  • Dummy variable: incorporating qualitative variable into regression analysis
  • Coding scheme 0/1 (One-hot encoding):
    • \(\beta_0\): Average Y when X = 0
    • \(\beta_1\): The difference in average Y when X = 0 and X = 1
  • Coding scheme -1/1:
    • \(\beta_0\): Average Y (ignore X)
    • \(\beta_1\): Amount each X have Y that above or below average
  • Remark: Coding scheme does not affect the fit
  • More than 2 level: Always one fewer dummy variable than number of levels
    • No dummy variable = baseline

Extension of linear model

  • Remove additive assumption:
    • Increase in \(X_j\) associated with one-unit increase in \(X_k\)
    • Hierarchical principle: If we include the interaction term, we also should include the main effect even if the p-value associated with the coefficients are not significant.
  • Nonlinear relationship: Polynomial regression

Potential Problems

  • Nonlinear between response and predictors
    • Residual plot: Observe any discerning patterns
    • Solution: transformation on X
  • Correlation among error terms
    • Underestimate the true standard error (Accidentally double the observations)
    • Very importance to linear regression
  • Non-constant variance of error term
    • Residual plot: Funnel shape in residual plot
    • Solution: transformation on Y
  • Outliers: Point where \(y_i\) is unusual given \(x_i\)
    • Remark: Outliers can have no effect on least square fit
    • Studentized residuals: Detect outliers
  • High-leverage points: Point where \(x_i\) is unusual
    • Remark: Removing high leverage points are important
    • Remark: High-leverage points are outliers, but not vice versa
    • Leverage statistics: To detect high leverage points!
    \[h_i = \frac{1}{n} + \frac{(x_i - \bar x)^2}{\sum_{i'=1}^n (x_{i'} - \bar x)^2}\]
    • Between 1 and \(\frac{1}{n}\) and the average value ALWAYS equals \(\frac{p+1}{n}\)
  • Collinearity: Two or more variables are closely related to each other
    • Difficult to separate out the individual effects
    • Inaccurate \(\hat\beta_j\) and \(SE[\hat\beta_j]\) increases, power of the hypothesis test reduces
    • Correlation matrix: Good for pairs of variables
    • Multicollinearity: Variance inflation factor
    • Solution: Drop or combine collinear variables together into a single predictor/

Comparsion with KNN

  • Parametric vs non-parametric
  • Linearity assumption: LR
  • Non-linearity in low dimension: LR
  • Non-linearity in high dimension: KNN
  • Small observations: LR (Curse of dimensionality)

Logistic Regression

Why not Linear Regression

  • Coding scheme: implies outcomes ordering.
  • Linear Regression: Hard to interpret as probability, and impossible for more than 2 classes

Logistic Model

  • Logistic function: \(p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\)
  • Odds: \(\frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1 X}\)
  • Logit odds: \(\log \frac{p(X)}{1-p(X)} = \beta_0 + \beta_1 X\)
    • \(\beta_1\): Increases X by 1 changes the logit odds by \(\beta_1\).

Estimating coefficients

  • Maximum likelihood function:
\[l(\beta_0, \beta_1) = \prod_{i:y_i = 1}p(x_i) \prod_{i':y_{i'}=0} (1 - p(x_{i'}))\]
  • Predictions: Possible to have qualitative predictors (using dummy variables)

Multiple Logistic Regression

\[p(Y=1\|X) = p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + ... + \beta_p X_p}}\]

Multinomial Logistic Regression

  • Baseline encoding:
\[P(Y = k \| X = x) = \frac{e^{\beta_{k0} + \beta_{k1} X_1 + ... + \beta_{kp} X_p}}{1 + \sum_{l=1}^{K-1}e^{\beta_{l0} + \beta_{l1} X_1 + ... + \beta_{lp} X_p}}, \quad \forall k = 1,...,K-1\] \[P(Y = k \| X = x) = \frac{1}{1 + \sum_{l=1}^{K-1}e^{\beta_{l0} + \beta_{l1} X_1 + ... + \beta_{lp} X_p}}, \quad k = K\]
  • Remark: The choice of baseline will change the coefficients, but the fitted value will be the same.

  • Softmax encoding: No baseline

\[P(Y = k \| X = x) = \frac{e^{\beta_{k0} + \beta_{k1} X_1 + ... + \beta_{kp} X_p}}{\sum_{l=1}^{K}e^{\beta_{l0} + \beta_{l1} X_1 + ... + \beta_{lp} X_p}}\]

Generative Models for classification

Motivation

  • Condition: Substantial separation between 2 classes, small sample size, predictors are approximately normal.
\[p_k(x) = P(Y = k \| X= x) = \frac{P(Y = k)P(X = x\|Y = k)}{\sum_{l=1}^K P(Y = l) P(X =x \| Y = l)} = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}\]
  • Challenge: estimating \(f_k(x) \rightarrow\) simplifying assumption

Linear Discriminant Analysis p = 1

  • Assumption: \(f_k(x) \sim N(\mu_k, \sigma_k^2)\) and \(\sigma_1^2 = ... = \sigma_K^2 = \sigma^2\)
\[\max_k p_k(x) =\delta_k(x) = x \times\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)\]
  • Remark: Linear in terms of x
  • Challenge: \(\mu_k\) and \(\sigma^2\) are unknown. Then, estimate
\[\hat\mu_k = \frac{1}{n_k}\sum_{i:y_i = k}x_i\] \[\hat\sigma^2 = \frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i = k}(x_i - \hat\mu_k)^2\] \[\hat\pi_k = \frac{n_k}{n}\]

Linear Discriminant Analysis p > 1

  • Assumption: \(f_k(x)\sim N(\mu_k, \Sigma)\)
\[\max_k p_k(x) = \delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log\pi_k\]
  • Decision boundary: \(\delta_k(x) = \delta_l(x)\)
  • Remark: Prefer when more than 2 classes (view data in low dimension)

Metric

  • Confusion matrix
  • Sensitivity: Percent of TP
  • Specificity: Percent of TN
    • Bayes classifier does poorly in Specificity since minimizing total error rate
    • Threshold (in K = 2) affects specificity and sensitivity \(\rightarrow\) domain knowledge
  • ROC curve: Overall performance of the classifier, all possible threshold, given by area under curve (AUC)
    • False positive rate = 1 - Specificity = Type I error
    • True positive rate = Sensitivity = Recall = 1 - Type II error = Power
    • Precision = Among positive prediction, how many actually TP

Quadratic Discriminant Analysis

  • Assumption: \(f_k(x)\sim N(\mu_k, \Sigma_k)\)
\[\max_k p_k(x) = \delta_k(x) = -\frac{1}{2}x^T\Sigma_k^{-1}x + x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k - \frac{1}{2}\log\|\Sigma_k\| + \log\pi_k\]
  • Challenge: High variance \(\rightarrow\) work better with more observations (bias-variance trade-off)

Naive Bayes

  • Assumption: Among kth class, p predictors are independent, \(f_k(x) = f_{k1}(x_1) \times... \times f_{kp}(x_p)\)
  • Remark: Works well when n < p (reduces variance)
\[P(Y = k \| X = x) = \frac{\pi_k \times f_{k1}(x_1) \times ...\times f_{kp}(x_p)}{\sum_{l=1}^K \pi_l \times f_{l1}(x_1) \times ... \times f_{lp}(x_p)}\]
  • If \(X_j\) is quantitative, \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) (Similar to QDA with diagonal covariance matrix) or nonparametric (kernel density estimator)
  • If \(X_j\) is qualitative, count the proportion of training obs for jth predictor corresponding to each class

Comparison of all methods

  • NB takes a form generalized additive model
  • LDA is a special case of QDA
  • Any linear boundary classifiers are special cases of NB with \(g_{kj} = b_{kj}x_j\)
  • NB when \(X_j\) is quantitative and \(f_{kj}(x_j) \sim N(\mu_{kj}, \sigma^2_{kj})\) are special case of QDA (\(\Sigma\) is a diagonal matrix)
  • LDA will be better than Logistic Regression when Normal assumption does not hold.
  • KNN is better when decision boundary is highly non-linear and n » p. If the relationship is non-linear and n \(\approx\) p, QDA is prefer.

Generalized Linear Model

  • Y belongs to Exponential Family Distribution
  • Poisson Regression: \(\lambda(X_1,...,X_p) = e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}\) and use MLE to find \(\hat\beta\)
    • Interpretation: Increases \(X_j\) by one unit with a change in \(\mathbb E[Y]\) by \(\exp(\beta_j)\)
    • Mean variance relationship: \(Var[Y] = \mathbb E[Y] = \lambda\)
    • Nonnegative fitted values
  • Link function: Transform mean of the response so that the transformed mean is a linear function of predictors

Sampling Method

  • Model selection: Select proper level of flexibility
  • Model assessment: Estimate the test error

Cross-validation

Validation Set approach

  • Randomly divided train and valdiation set
  • Challenge: Overestimate and highly variable test error rate
  • High bias (overestimation), High variance

LOOCV

  • Leave one as validation and the rest as training set. Repeat n times

    \[CV_{(n)} = \frac{1}{n}\sum_{i=1}^n MSE_{i}\]
  • Benefits: Unbiased estimate of test error rate, less variance and bias in train/valid split.
  • Challenge: expensive
  • Remark: Less expensive for LSS or polynomial regression (exact solution)

k-fold CV

  • Randomly divide the dataset into k groups

    \[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\]
  • LOOCV is a special case
  • Benefits: Less expensive, more accurate estimation of test error rate
  • Challenge: More variance (more random in split), more bias (less in training)
  • Classification: Similar idea, but can sometimes underestimate the test error rate!

Bootstrap

  • Apply when difficult to obtain a measure of variability.
  • Problem: Data cannot be generated from original population.
  • Solution: Repeated sampling (with replacement) observations from the original dataset \(\rightarrow\) obtain the estimation and use the formula to obtain the standard error of the estimation.

Regularization

Motivation

  • Instead of using LS model
  • Model interpretability: Irrelevant variables leads to unnecessary complexity, which makes harder to interpret.

Subset selection

  • Best Subset selection: \(2^p - 1\) models
    • For each k (0,…, p-1) predictors \(\rightarrow\) Fit \(p - k\) models with one more predictors \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
    • Challenge: expensive

Stepwise selection

  • Motivation: Large search space \(\rightarrow\) overfitting and high variance in the estimate
  • Forward:
    • For each k (1,…, p) predictors \(\rightarrow\) Fit \(p - k\) models with one additional models \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
    • Do well in practice
    • Not guarantee to find best model
    • Possible to apply when n < p
  • Backward:
    • For each k (p, p-1, …, 1) predictors \(\rightarrow\) Fit \(k\) models with one less predictor than previously \(\rightarrow\) Choose best model (smallest RSS or highest \(R^2\), or deviance), called \(M_k\) \(\rightarrow\) Choose the best model among \(M_k\) (k=1,…,p) (cross-validation error, \(C_p\), BIC, or adjusted-\(R^2\))
    • Requires n > p
  • Hybrid: Combination of two previous ones

Choosing the optimal model

  • Indirectly estimate testing error: adjust training error
    • Assumption about true underlying model
    • \(C_p\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
      • \(\hat\sigma^2\): estimate of \(\epsilon\) (full-model)
      • More features (d) will increase \(C_p\)
      • \(E[C_p]\) = Test MSE (unbiased)
    • \(AIC\): \(\frac{1}{n}(RSS + 2d\hat\sigma^2)\)
      • Class of models fit by MLE
      • Proportional to \(C_p\)
    • \(BIC\): \(\frac{1}{n}(RSS + \log(n)d\hat\sigma^2)\)
      • Bayesian POV
      • Heavier penalty for large model
    • Adjusted-\(R^2\): \(1\frac{RSS/(n-d-1)}{TSS/(n-1)}\)
      • Adding more features reduces RSS, and reduces (n-d-1). Therefore, the relative change matters.
      • Not well motivated in statistical theory
      • Large value is better (Opposite with above three criterion)
  • Directly estimate testing error: validation set approach or cross-validation
    • Fewer assumption about true underlying model
    • Prefer when it is hard to pinpoint df and \(\sigma^2\)

Shrinkage

  • Ridge regression:
\[RSS + \lambda\sum_{j=1}^p \beta_j^2\]
  • \(\lambda\): shrinkage penalty, shrink \(\beta_j\) toward zero
  • Remark: shrinkage penalty is not applied to \(\beta_0\) (measure of the mean value of the response)
  • Remark: \(\lambda\) increases may increase estimated coefficients
  • Remark: LS solution is scale equivariant, while RR coeff is not.
  • Remark: Standardizing features before applying ridge regression
  • Bias-variance trade-off: Increase bias, reduce variance (flexibility)
  • Remark: Shrink coefficients towards, but not exactly, zero
  • \(\min_\beta \|\|Y - X\beta\|\|^2\) subject to \(\sum_{j=1}^p \beta_j^2 \le s\)
  • Remark: Normal distribution prior and follows by posterior mode/mean for \(\beta\)

  • LASSO:
\[RSS + \lambda\sum_{j=1}^p \|\beta_j\|\]
  • Model interpretation/Variable selection: Forcing some coefficients to be exactly zero
  • Sparse model: \(\lambda\) is sufficiently large
  • \(\min_\beta \|Y - X\beta\|^2\) subject to \(\sum_{j=1}^p \|\beta_j\| \le s\)
  • Closely related to Best Subset selection: \(\min_\beta \|Y - X\beta\|^2\) subject to \(\sum_{j=1}^p I(\beta_j\not = 0) \le s\)
  • Remark: Corner solution
  • Remark: Neither LASSO nor Ridge regression universally dominates the other
  • Remark: Since the derivative of absolute function does not exist at 0, we need soft thresholding (explicitly set coefficients to be 0)
  • Remark: Laplace distribution prior and follows posterior mode (not mean)

Dimension reduction: To be written in the near future

Consideration in high dimension: To be written in the near future

Beyond Linearity

Polynomial regression

\[y_i = \beta_0 + \beta_1 x_i + ... +\beta_d x_i^d +\epsilon\]
  • Extremely non-linear curve
  • Remark: d < 5, otherwise overly flexible and strange shapes
  • Variance of the fit: point-wise square-root of the variance estimate for each coefficients
  • Applicable for linear and logistic regression

Step functions:

\[y_i = \beta_0 + \beta_1 C_1(x_i) + ... + \beta_K C_K(x_i) + \epsilon\]
  • Global structure on the non-linear function
  • Continuous variable \(\rightarrow\) ordered categorical variable
  • \(C_K(x) = I(c_K \le x \le c_{K+1})\): indicator function, dummy variable
  • \(\beta_0\): Mean value of Y for \(X < c_1\)
  • \(\beta_j\): Average increase in Y for X in \(c_j \le X < c_{j=1}\) relative to \(X < c_1\)
  • Remark: Unless natural breakpoints, miss the action

Basis functions:

\[y_i = \beta_0 + \beta_1 b_1(x_i) + ... + \beta_K b_K(x_i) + \epsilon\]
  • \(b_k\): fixed and known function (Step and polynomial regression are special case of this)
  • Applicable for OLS

Regression splines:

  • Piecewise Polynomials: \(y_i = \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \epsilon\) if \(x_i < c\) otherwise \(\beta_{02} + \beta_{12} x_i + \beta_{22}x_i + \epsilon\)
    • Knots: c, more knots = more flexible
    • Remark: Discontinuous (too flexible)
    • Remark: 1 knot and 4 parameters \(\rightarrow\) 8 parameters in total
  • Constraints: first up to K-1 order derivative must be continuous
    • Remark: Every constraint frees one degree of freedom
    • Remark: Cubic splines has K (knots) + 4 (\(\beta_0, \beta_1, \beta_2, \beta_3\)) degrees of freedom
  • Spline Basis Representation:
    • Truncated Power basis (cubic spline): \(h(x, \xi) = (x - \xi)_+^3\) with \(\xi\) is the location of the knot
    • Remark: Discontinuity in third derivative
    • Remark: Higher variance at outer range
  • Natural spline: Additional linearity boundary constraints
  • Choosing the number and locations of the knots
    • Remark: More knots = more flexible = coefficients change rapidly
    • Uniform fashion
  • Compare with Polynomial Regression
    • Different ways to introduce flexibility \(\rightarrow\) more stable

Smoothing splines:

g = \(\arg\min_g \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2dt\)

  • Remark: g is VERY flexible (interpolates all \(y_i\) makes RSS = 0) - Roughness: second derivative = how fast first derivative change - Remark: Integration = total change in the first derivative - \(\lambda\): \(\lambda \rightarrow \infty\), g becomes very smooth - Remark: shrunken version of natural cubic spline - Remark: As \(\lambda\) increases \(0 \rightarrow \infty\), \(df_{\lambda}\) decreases \(n\rightarrow 2\) - Effective degrees of freedom: measure the flexibility of the spline

Local regression:

- Compute the fit at target point using nearby training observations
- $$K_{i0}$$ different for each value of $$x_0
- **Memory-based**: Similar to KNN (need neighbor to compute)
- **s**: span, proportion of points to compute local regression, similar to $$\lambda
  - **Remark**: smaller = more wiggly
- Poor performance in high dimension (Curse of dimensionality)

Generalized additive models: To be written in the future

Tree-based

  • Basics of Decision Trees
  • Regression Trees
    • Stratification of Feature Space
    • Tree Pruning
  • Classification Trees
  • Trees vs Linear models
  • Advantages and Disadvantages
  • Bagging, RF, Boosting, Bayesian Additive Regression Trees
    • Bagging
      • Out-of-Bag error estimation
      • Variable Important measures
    • RF
    • Boosting
    • Bayesian Additive Regression Trees: To be added in the future

SVM

  • Maximal Margin Classifier
    • Hyperplane
    • Classification using separating hyperplane
    • The classifier
    • Construction of the classifier
    • Non-separable cases
  • Support Vector Classifier
    • Overview
    • Detail
  • Support Vector Machines
    • Non-linear Decision Boundaries
    • More than 2 cases
      • One vs one
      • One vs all
    • Relationship to logistic regression

Neural Network

Unsupervised Learning