Linear regression (MSE minimization)
it’s 0L neural network (perceptron) with act(x) = x
Ridge regression = Linear + regularization2
L2 prevents collinear weights from unjustified mutual explosion : x1 - x2 ~= 10e5 * x1 - 10e*5 * x2
Logistic regression → (Likelihood minimization)
perceptron with act(x) = sigmoid(x)
train N trees, then ask everyone for answer and take average
Condorcet principle 20 experts with 51% precision => 75% precision
Usage:
Check Worst and Best scenario for particular model
Show how stable/unstable are the model training over folds
Search for model hyper parameters
N - lenght of train data
K<N - lenght of samples
n - number of bins
We make n bins where we sample with repetition K train objects from N.
On each bin we train network. We will have n networks.
we average network answers (Condorcet principle)
Note: Condorcet works well even with overfitted networks
Train n algorithms and use their predictions as new features
(with or without initial data)
Iteratively train new algorithms that compensate the mis-estimation of previous
GD works poorly with high/low coeff magnitude. It might be improved by normalizing coeff
Efficiency: GD use entire dataset for a single step
Faster convergence: SGD is noisier, it can lead to faster initial convergence
Better Generalization: the noise by SGD helps in avoiding local minima / saddle points
Memory: SGD batch requires only few samples in RAM
Keep z small => prevent sigmoid saturation ( i.e. σ(z) * (1 - σ(z)) -> 0, z>>1)
add model stability (prevent collinear weights from unjustified mutual explosion)
biases (b_coeff) are not regularized
L1: (aka LASSO, least absolute shrinkage and selection operator)
(+) Help remove irrelevant features
(-) no analytical solution. Gradient instability in zero
L2: (aka Ridge)
(+) allow small non-zero weights for free : eps^2 = O(eps)
(+) have analytical solution
Note! To make regularization efficient => normalize your data. If not, unfair regularization = bad.
Freeze some neurons and learn data with remaining neurons
Idea: averaging the answers of many networks performs better.
Neuron should not rely a lot on the presence/absence of other neurons
Network should not rely a lot on the presence/absence of some particular features
Hiding some neurons it’s like train different networks
Other neurons should learn to work independently
forces network to learn more stable data representation
Train loss can be worse than test (we perturb net during learning)
model.predict() : do not apply dropout
<<<Code block>>>
self.dropout = nn.Dropout(p=0.5)
Normalization - is a type of regularization.
It helps prevent overfitting by keeping coeff magnitude stable.
Equally regularize feature weights
improve stability by reducing model parameter sensibility
We can now use bigger learning rate => faster learning
We don’t want to regularize intercept. It’s spread between model and data
Idea
Normalize by Mean and Var every batch
Learn new Mean and Var y =x+
model.predict()
when predict (forward pass) we have 1 object (batch_size=1)
We use then general mean/var = batch norm moving average
<<<Code block>>>
self.bn = nn.BatchNorm1d(120)
Every batch is set of 3D tensors [ 🧊, 🧊, 🧊… 🧊]
We want image pattern (cat) to be spacial independent
✔️We average by batch & height & weight
❌Meaningless : (white_pixel + cat_pixel + white_pixel + white_pixel)/4
<<<Code block>>>
self.convbn = nn.BatchNorm2d(16)
Simplify (simple model can’t be overfitted)
Gather more data (new data, fill missing)
Preprocess data (data errors, outliers)
Regularization (huge model parameters is not a good idea)
Bias - difference between estimation and true value
Variance - dispersion, how far a particular model output might be from model estimation
High bias : wrong model assumption, model is not able to catch the true relationship
High variance : sensitivity to small fluctuations in input data, model is trying to learn the noise
Tradeoff: property when bias might be improved by losing in variance and vice versa.
Accuracy
number of misclassified objects
Precision
Recall
F-score
ROC-curve ROC AUC
PR-curve
Use Adam as optimizer
Initialize random (❌ constants & zero will all train in the same direction)
Use 3e-4 as input learning rate
For a simple data set 3e-2 can be better
To choose learning rate, start learning with different 10 rates, check convergence
Always monitor model quality [train & test] x (accuracy, precision, recall, entropy etc)
we often learn faster when we are badly wrong about something
cross-entropy is always better that squares, provided the output neurons are sigmoids
The learning rate for different functions are two different concepts.
1960, people used to compute network gradients as [C(w+eps) - C(w)] / eps.
For fixed weight we do one whole forward pass.
It requires N>>1 forward passes.
Backprop: each error in layer L-1 takes into account errors in previous layer L.
Backprop ~ bootstrap
Don’t learn the noise. Learn the patterns!
✔️ solve random oscillations
✔️ avoid local minimums
Compute gradient after inertion step
✔️ in case of very high velocity, post-computed gradient is more accurat
✔️Solve choice of learning rate for each parameter (weights)
If previous gradient is HIGH => adaptive learning rate is LOW
If previous gradient is LOW => adaptive learning rate is HIGH
❌cache always grows => at some point stop gradient vanishing
(+) exponentially weight past gradients:
Adam
Gradient Vanishing
Typical for deep networks
The first layers do not receive updates
Typical for sigmoid / tanh
✅Treatment:
Use ReLU activation
Better Weight Initialization
Batch Norm or Layer Norm
Use Skip Connection
Gradient Explosion
Typical for deep networks
The first layers are over-changed
Typical for sigmoid
✅Treatment:
Gradient Clipping
Better Weight Initialization
Smaller Learning Rates
Use ReLu Activation
Use Skip Connectors
Gradient Clipping
Skip Connection
If the network fails to learn the weights of layer 2, the gradient remains and will flow further.
(+) creates sort of network "memory" which is combination of low and high level patterns
(-) requires many RAM to keep previous layer's outputs
Inductive Transfer Learning
target task is different from source task
the model use the knowledge from the source task
Fine-tuning
pre-trained model is further trained on task-specific data
Feature extraction
classifier/regressor is trained on features (last hidden layer)
Fine-tuning Algo:
Take pretrained big network
Replace last layer(s) architecture (final layer = number of new classes)
Freeze first layers (low-level, normally conv layers)
Train on specific (small) data set
Cases
Source != Target Domain (English vs Russian language)
Keep inter-domain, re-learn intra-domain
Source != Target Distribution ( english reviews on films vs apps)
Source != Target Labels (race vs emotion classification by same photos)
Unsupervised (no target labels)
Rule of thumb
More data we have & bigger source/target difference => more last layers to train
Small data set & similar source/target difference => only final / pre-final layers to train
Why don't we use accuracy as a target (aka loss) function?
Not smooth. Can’t learn.
Why not just loss function, why do we need metrics?
Use loss function as performance is not always good as we are learning to minimize loss function.
Metrics allows us to independently check the performance
Why should we exclude collinear features?
For analytical solutions, we can't solve it as feature matrix is collinear
Co-compensation makes learning unstable, especially without regularization
What is the difference between Loss function and Metric?
Loss function is used to train. Metric - to evaluate.
Loss must be differentiable. Metric - anything.
metric can also be loss function if differentiable.