All_in_One

Baseline Models

Regression

Linear regression (MSE minimization)
- it’s 0L neural network (perceptron) with act(x) = x
Ridge regression = Linear + regularization2
- L2 prevents collinear weights from unjustified mutual explosion : x1 - x2 ~= 10e5 * x1 - 10e*5 * x2
Logistic regression → (Likelihood minimization)
- perceptron with act(x) = sigmoid(x)

Random forest

train N trees, then ask everyone for answer and take average
Condorcet principle 20 experts with 51% precision => 75% precision

Model evaluation

Cross Validation

Usage:

Check Worst and Best scenario for particular model
Show how stable/unstable are the model training over folds
Search for model hyper parameters

Bootstrap aggregation

N - lenght of train data
K<N - lenght of samples
n - number of bins

We make n bins where we sample with repetition K train objects from N.
On each bin we train network. We will have n networks.
we average network answers (Condorcet principle)

Note: Condorcet works well even with overfitted networks

Stacking

Train n algorithms and use their predictions as new features
(with or without initial data)

Boosting

Iteratively train new algorithms that compensate the mis-estimation of previous

neural network

Gradient descent (GD)

GD works poorly with high/low coeff magnitude. It might be improved by normalizing coeff

Why SGD?

Efficiency: GD use entire dataset for a single step
Faster convergence: SGD is noisier, it can lead to faster initial convergence
Better Generalization: the noise by SGD helps in avoiding local minima / saddle points
Memory: SGD batch requires only few samples in RAM

rEGULARIZATION

L1& L2 Regularization

Keep z small => prevent sigmoid saturation ( i.e. σ(z) * (1 - σ(z)) -> 0, z>>1)
add model stability (prevent collinear weights from unjustified mutual explosion)
biases (b_coeff) are not regularized
L1: (aka LASSO, least absolute shrinkage and selection operator)
- (+) Help remove irrelevant features
- (-) no analytical solution. Gradient instability in zero
L2: (aka Ridge)
- (+) allow small non-zero weights for free : eps^2 = O(eps)
- (+) have analytical solution

Note! To make regularization efficient => normalize your data. If not, unfair regularization = bad.

Dropout

Freeze some neurons and learn data with remaining neurons

Idea: averaging the answers of many networks performs better.
Neuron should not rely a lot on the presence/absence of other neurons
Network should not rely a lot on the presence/absence of some particular features
Hiding some neurons it’s like train different networks
Other neurons should learn to work independently
forces network to learn more stable data representation
Train loss can be worse than test (we perturb net during learning)
model.predict() : do not apply dropout

<<<Code block>>>

self.dropout = nn.Dropout(p=0.5)

Normalization

Normalization - is a type of regularization.

It helps prevent overfitting by keeping coeff magnitude stable.

Equally regularize feature weights
improve stability by reducing model parameter sensibility
We can now use bigger learning rate => faster learning
We don’t want to regularize intercept. It’s spread between model and data

Batch norm: all layers

Idea
- Normalize by Mean and Var every batch
- Learn new Mean and Var y =x+
model.predict()
- when predict (forward pass) we have 1 object (batch_size=1)
- We use then general mean/var = batch norm moving average

<<<Code block>>>

self.bn = nn.BatchNorm1d(120)

Convolutional batch norm

Every batch is set of 3D tensors [ 🧊, 🧊, 🧊… 🧊]
We want image pattern (cat) to be spacial independent
- ✔️We average by batch & height & weight
- ❌Meaningless : (white_pixel + cat_pixel + white_pixel + white_pixel)/4

<<<Code block>>>

self.convbn = nn.BatchNorm2d(16)

Overfitting

Simplify (simple model can’t be overfitted)
Gather more data (new data, fill missing)
Preprocess data (data errors, outliers)
Regularization (huge model parameters is not a good idea)

Variance biased trade-off

Bias - difference between estimation and true value
Variance - dispersion, how far a particular model output might be from model estimation

High bias : wrong model assumption, model is not able to catch the true relationship
High variance : sensitivity to small fluctuations in input data, model is trying to learn the noise

Tradeoff: property when bias might be improved by losing in variance and vice versa.

Quality function (metrics)

Accuracy
- number of misclassified objects
Precision
Recall
F-score
ROC-curve ROC AUC
PR-curve

Learning

Big finger rule (empirical advise):

Use Adam as optimizer
Initialize random (❌ constants & zero will all train in the same direction)
Use 3e-4 as input learning rate
- For a simple data set 3e-2 can be better
- To choose learning rate, start learning with different 10 rates, check convergence
Always monitor model quality [train & test] x (accuracy, precision, recall, entropy etc)
we often learn faster when we are badly wrong about something
cross-entropy is always better that squares, provided the output neurons are sigmoids
The learning rate for different functions are two different concepts.

Backprop

1960, people used to compute network gradients as [C(w+eps) - C(w)] / eps.
- For fixed weight we do one whole forward pass.
- It requires N>>1 forward passes.
Backprop: each error in layer L-1 takes into account errors in previous layer L.
Backprop ~ bootstrap
Don’t learn the noise. Learn the patterns!

optimization

Stepik: Neychev

Momentum (inertion)

✔️ solve random oscillations
✔️ avoid local minimums

Nesterov momentum

Compute gradient after inertion step

✔️ in case of very high velocity, post-computed gradient is more accurat

Adagrad: Adaptive gradient algorithm

✔️Solve choice of learning rate for each parameter (weights)
- If previous gradient is HIGH => adaptive learning rate is LOW
- If previous gradient is LOW => adaptive learning rate is HIGH
❌cache always grows => at some point stop gradient vanishing

RMSProp: forget past cache, focus on recent cache

(+) exponentially weight past gradients:

Adam

Practical problems and pitfalls

Gradient Vanishing

Typical for deep networks
The first layers do not receive updates
Typical for sigmoid / tanh

✅Treatment:

Use ReLU activation
Better Weight Initialization
Batch Norm or Layer Norm
Use Skip Connection

Gradient Explosion

Typical for deep networks
The first layers are over-changed
Typical for sigmoid

✅Treatment:

Gradient Clipping
Better Weight Initialization
Smaller Learning Rates
Use ReLu Activation
Use Skip Connectors

Gradient Clipping

Skip Connection

If the network fails to learn the weights of layer 2, the gradient remains and will flow further.

(+) creates sort of network "memory" which is combination of low and high level patterns

(-) requires many RAM to keep previous layer's outputs

Transfer Learning

Inductive Transfer Learning

target task is different from source task
the model use the knowledge from the source task

Fine-tuning

pre-trained model is further trained on task-specific data

Feature extraction

classifier/regressor is trained on features (last hidden layer)

Fine-tuning Algo:

- Take pretrained big network
- Replace last layer(s) architecture (final layer = number of new classes)
- Freeze first layers (low-level, normally conv layers)
- Train on specific (small) data set

Cases

- Source != Target Domain (English vs Russian language)
  - Keep inter-domain, re-learn intra-domain
- Source != Target Distribution ( english reviews on films vs apps)
- Source != Target Labels (race vs emotion classification by same photos)
- Unsupervised (no target labels)

Rule of thumb

- More data we have & bigger source/target difference => more last layers to train
- Small data set & similar source/target difference => only final / pre-final layers to train

FAQ

Why don't we use accuracy as a target (aka loss) function?

Not smooth. Can’t learn.

Why not just loss function, why do we need metrics?

Use loss function as performance is not always good as we are learning to minimize loss function.

Metrics allows us to independently check the performance

Why should we exclude collinear features?

For analytical solutions, we can't solve it as feature matrix is collinear
Co-compensation makes learning unstable, especially without regularization

What is the difference between Loss function and Metric?

Loss function is used to train. Metric - to evaluate.
Loss must be differentiable. Metric - anything.
metric can also be loss function if differentiable.

Page updated

Google Sites

Report abuse