This video by Andrew NG explains this the best. Normalizing input features to mean zero and variance 1 speeds up learning. Batch normalization does similar thing for hidden layers. Intuition-
When training images with black cats, the network may not work well with colored cats. This is because of the shift in the data and decision boundary wrt the origin. The input to each hidden layer from the previous layer shifts around. (like inputs which are not normalized) So it is better to add batch normalization layer.

It also has a slight regularization effect. Each mini batch is scaled by mean/variance of only that mini batch. therefore this adds noise to each hidden layers activations. Larger mini batch reduces this noise. So regularization effect.

Regularization techniques

Weight sharing- as done in CNN’s, applying the same filters across the image.
Data Augmentation- Augmenting existing data and generate synthetic data with generative models
Large amount of training data- thanks to ImageNet etc.
Pre-training- For example say Use ImageNet learnt weights before training classifier on say Caltech dataset.
The use of RelU’s in Neural Nets by itself encourages sparsity as they allow for zero activations. In fact for more complex regions in feature space use more RelU’s, deactivate them for simple regions. So basically vary model complexity based on problem complexity.

Saving Models

Specifically in keras models can be saved in 3 formats viz. yaml, json and hdf5. hdf5 models are saved with weights, for the others it is just string.

model.save("path/to/file.yml/.h5/.json")

For serving models saved_model is used. The model is saved as a protobuf.

from tensorflow import saved_model

saved_model.save(obj, "path/to/save", signatures, options)

Callbacks

Used while model.fit(). For multiple functions such as:

Early Stopping

Rather than deciding how many epochs we need to train the NN. We keep a separate validation set and train the NN such that the val loss stops decreasing.
Generally the training set has better loss compared to validation because the NN has those values and thus has learnt those values.

from tensorflow.keras.callbacks import EarlyStopping
moniter = EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)

model.fit(X, y, validation_data=(X_test, y_test), callbacks=[moniter], epochs=1000)

Backpropagation and Gradient Descent

Classic backpropagation
Momentum backpropagation - This adds a parameter lambda which is used to push the optimizer away from the local minimum so as to find a global minimum.
Batch and online backpropagation - Generally every gradient is calculated for 1 row. In batching every row gradient vectors are added until batch size is reached. eg. After 10 elements are summed up then the change in weight is applied. This makes things efficient.

Things to tweak in backpropagation

Learning rate - if too small, might get stuck in local optima
Momentum - if too large stuff becomes erratic

Optimizers

classic batch gradient descent - Takes the whole data and runs gradient descent. It is very slow. Not like online which allows us to update examples on-the-fly.
Stochastic gradient descent - The data is picked randomly and put into gradient descent.

Activations

ReLU PReLU tanh sigmoid softmax

Intuition behind using better optimizers.

A single learning rate is used for weights across the network. Maybe some neurons learn faster than others. Sometimes learning rate can be decreased with time or put multiple learning rates. Move away from having single global momentum and learning rate.

Usually if a feature x is sparse (majority are 0) and input to a neuron, its weight/gradient will be 0 for most of the inputs. Therefore the weights for that neuron will not get enough updates, because majority of them are 0. Now if feature x is sparse and important then the neuron representing that feature is not learning much. So we need to update the learning rate based on the frequency of features.

So if multiple 0’s coming in > less frequency > needs higher learning rate.

if more non zeros coming in > greater frequency > needs lower learning rate.

Adagrad - keep per weight decaying learning rate. Never increases. It is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Problem is the update decays to 0 very soon for ones with higher frequency. So no more updates after certain point.

RMSprop - Solves problem of adagrad by dividing learning rate with exponentially decaying average of squared gradients.
Adadelta - can go in either direction.
ADAM (adaptive moment estimation) Discovered in 2014. Uses RMSprop denominator decay along with cumulative history of gradients.

Kfold and stratifiedKFold cross validation

For KFold-

from sklearn.model_selection import KFold

folds = KFold(5, shuffle=True, random_state=42)

for train, test in folds.split(x):
    x[train]
    y[train]
    x[test]
    y[test]

For StratifiedKFold-

from sklearn.model_selection import StratifiedKFold
                                        # need to put random state otherwise different results
folds = StratifiedKFold(5, shuffle=True, random_state=42)

for train, test in folds.split(x, y):
    x[train]
    y[train]
    x[test]
    y[test]

5 fold cross validation. 1 fold validation rest are trains. So 5 models for 5 fold. Now to bring them together ways are as follows:

Choose model with lowest val loss. If there is major variance between the scores,then some validation fold has outliers.
Present new data to all models and average out like an ensemble.
Retrain new model by doing early stopping for each model on kfold and figure out epochs for each and take average epochs to train the new model or maximum epochs.

Regularizations

L1 and L2 regularizations

from tensorflow.keras.layers import Dense

model.add(Dense(units, activation, activity_regularizer = regularizers.l1(1e-4)))

Dropouts

Simplifying model by removing some neurons. These neurons keep their weights but do not fire.

Bootstrapping and benchmarking hyperparameters

Things that need tuning-

number of layers
number of units/neurons per layer
activation used
dropout percent
L1 and L2 values for each layer (also optimizer, class weights etc.)

Bootstrapping -

random data picking with replacement.
Then the accuracy or rmse is averaged for all runs to get an idea of performance.
Early stopping is used to understand how many epochs are needed

from sklearn.model_selection import ShuffleSplit # for regression, StratifiedShuffleSplit # for classification
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import confusion_matrix, mean_squared_error
SPLITS = 50
boot = ShuffleSplit(n_splits=SPLITS, test_size=0.1, random_state=42)

moniter = EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)

epochs = moniter.stopped_epoch

score = np.sqrt(mean_squared_error(pred,test))

# track mean, stddev, and epochs
# slowly mean, epochs etc... will start to converge till we reach split 50

CNN Layers

parameters- units/#of filters, filter, stride, padding, activation

total weights = filter size * filter size * # of filters

keras needs convnet height * weight * color dept (which is 1 if it is grayscale)

ResNet

Skipping layers. Improves predicatability power of deeper NN.

lr schedule code

smaller learning rates need more epochs as model trains at a slower rate.

# learning rate can be reduced according to CIFAR competition after 
# 80, 120, 160, 180 epochs. the function can be used with a callback

Bulding ensembles using keras and sklearn

from sklearn.ensemble import RandomForestClassifier, KerasClassifier
def keras_func():
    #etc.

models = [RandomForestClassifier(n_estiamtors=10),
            kerasClassifier(keras_fn=)]

for i, model in enumerate(models):
    for train, test in k.split(X,y):
        model.fit(x_train, y_train)
        blend_test.append(pred(x_test))

# run logistic regression on blend_test and fit check accuracy with y_test

Hyperparameter tuning

Number of layers and Neuron counts

Activation (parameter also)
ActivityRegularization (parameter also)
Dense
Dropout
Flatten
Input (parameter also)
Lambda - similar to python lambda function for mapping
Reshape/permute - change shape of layers

Activation functions

linear - used for regression on output layer
softmax
tanh - used in lstms usually
sigmoid
hard_sigmoid - cheaper to compute
exponential
relu
elu - can produce negative outputs used in GANS. exponential linear unit.

Advanced activation functions

LeakyReLU - prevents dead relu units. by adding some small bias.
PReLU - learns alpha term which is also present in leaky

Regularization

L1, L2 activity_regularizar and kernel_regularizer
dropout

Batch Normalization

can be used to combat vanishing gradient and also take care if the learning rate increases drastically and takes nans

Training parameters

optimizers
batch size - number of rows passed between for 1 complete forward and backward pass
learning rate
epochs - number of times the network sees the entire data
iterations (not in keras) - 1 complete forward and backward pass

def evaluate_network(dropout, lr, neuronPct, neuronShrink):
    boot = StratifiedShuffleSplit(2, test_size=0.2, random_state=42)

    mean_benchmark = []]
    epochs_needed = []
    num = 0
    neuronCount = int(neuronPct * 50000)

    for train, test in boot.split(x,y):
        X_train, X_test, y_train, y_test
        
        layer = 0
        while neuronCount > 20 and layer < 10:
            if layer==0:
                model.add(Dense(neuronCount, input_shape=X.shape[1], activation=PReLU))
            else:
                model.add(Dense(neuronCount), activation=PReLU())
            model.add(Dropout(dropout))
    
        neuronCount = neuronCount * neuronShrink
    
        model.add(Dense(1), activation=sigmoid)
        # return negative log loss

Bayesion Hyperparameter Optimization

parameters are weights that backpropagation adjusts. hyperparameters like layers etc. need to be set on our own.

Usually layers go in pyramid form. From large layers to smaller layers.

Nelder mead can also be used. Bayesian optimization is used because it uses the past experience. Kind of like multiarmed bandit problem.

from bayes_opt import BayesianOptimzation

pbounds = {"dropout" : (0.0, 0.499),
            "lr": (0.0, 0.1),
            "neronPct": (0.01, 0.9),
            "neuronShrink": (0.1, 1)}


optimizer = BayesianOptimzation(f= evaluate_network,
pbounds = pbounds,
verbose=2)


optimizer.maximize(init_points=10, n_iter=100)

LSTM

3 axes- axis 1: training set elements (sequences) must be same size as y size axis 2: members of the sequence (day 1, day2, day3, day4) axis 3: features in the data (like input neurons)

#converting to sequence

def to_sequence(seq_size, obs):
    x=[]
    y=[]
    for i in range(len(obs)-seq_size-1):


        window = obs[i:(i+seq_size)]
        after window = obs[i+seq_size]
        window = [[x] for x in window]
        x.append(window)
        y.append(after_window)

Transfer learning for computer vision

Use of MobileNet

It is used extensively because it is lightweight in terms of size and computations

from tensorflow.keras.applications import MobileNet

model = MobileNet(weights="imagenet", include_top=True)
# if include_top is False we shear the output layer during transfer learning

For retraining new layers and assigning layers as trainable or non trainable

base_model = MobileNet(weights="imagenet", include_top=False)
base_model.summary() # get the summary to understand how many output layers are present and which ones were removed
x=base_model.output
x.add(GlobalAveragePooling2D())
x.add(Dense())
x.add(Dense())
preds=x.add(Dense(3, activation="softmax"))


model = Model(iputs=base_model.input, outputs=preds)

for layer in model.layers[:20]: # input layers not trainable
    layer.trainable=False
for layer in model.layers[20:]: new layers that were added are trainable
    layer.trainable=True

Keras can also read from a directory Create folder names with class names and put images inside themwith any name.

from tensorflow.keras.preprocessing.image import ImageDataGenerator, img_to_array
from tensorflow.applications.resnet50 import preprocess_input

train_datagen = ImageDatagenerator(preprocessing_function=preprocess_input)
train_datagen.flow_from_directory("Users/praths/Downloads",
target_size=(128,128),
clor_mode="rgb",
batch_size=1,
class_mode="catagorical",
shuffle=True)

Transfer learning can also be used for getting vector embeddings for NLP and doing feature engineering.

YOLO Darknet for object detection

YOLO - you only look once. This has a single CNN but a very complex output layer and can detect multiple objects within the same frame. Essentially it resizes image and runs a CNN on it. It gets lot of bounding boxes with predictions. Then we set a threshold, things above this are considered rest are thrown away.

The YOLO is a combination of regression and classification.

The total number of bounding boxes i.e. the output layer neurons is always fixed.

The input and dense layers are the same. They are CNN, maxpooling and dense. The output layer has - x, y, w, h, labels (relative probabilities of each label), confidence of the rectangle (of the label). So number of neurons are SS(B.5+C). S - number of dimensions of YOLO grid overlaid across the image. B - number oof potential bounding rectangles. 5 - x, y, w, h, labels C - confidence.

Trouble recognizing small highly dense things. like flock of birds on horizon. DarkNet - written in C (original implementation of YOLO) DarkFlow - implementation in python using tensorflow.

The output is the coordinates. Stuff that can be done is a lot. eg. in realtime if the signal turns green move car.

GANs

Generative adversarial neural network. 2 NNs working together to produce highly realistic image or other form of data. To spot if an image is fake look at background. It is usually dreamy or surreal. Look at shoulders they cannot be predicted easily with GANs. How does a GAN work?

Generative NN takes random data and puts out images/ random data. Discriminator takes in pictures and tries to detect if the images are real. Both fight against each other to make each other better which is being adversaries of each other. At the end one of them is kept and other is discarded. SO for face generator generator is kept and discriminator is discarded.

Discriminator

Job is to detect if the image coming into the NN is real or fake. Usually ip is h,w,color depth.

Generator

Job is to take a random seed [12,2,55,3] as input (a random array of high dimension) and gives out a random face. While training a GAN use a specific distribution and while generating faces use seed values from the same distribution.

Training

While training only one of the NNs weights must be modified. SO while training generator discriminator weights must not be updated. Its cheating.

Training the generator - take random seeds send to generator. Which generates random faces. Send that to discriminator and get y^. Now the y is always 1. Calculate the backprop based on y and y^ and adjust weights in generator. Things to not eis that generator never sees the training set and y is always 1.

Training the discriminator - now we have balanced y. training data is used now. it is given as input to discriminator along with the data from generator and then backprop is calculated. wrt y and y^ and discriminator weights are adjusted.

import tensorflow

tensorflow.gradienttape #is used for automatic differentiation in backprop

Nvidia styleGAN2

transfer learning. Comes with pre trained NNs. can generate faces, cars etc…

GANs for semi supervised training

Usually for face generation we discard the discriminator. But if we are using semi supervised we will discard generator. eg. the discriminator will tell the difference between fake and real medical records. Along with this we will have a discriminator who can get 4 different classes and have a separete fake record. Another example is street house view - Data from addresses taken from images from sides of buildings. GAN could classify digit types even if it doesnt have labels on them.

GANs can be used for upscaling old images to better resolution. Stuff like DeepFake as well. Old video game upscaling too.

Table of Contents

Keras

General

Known issues

Batch Normalization

Regularization techniques

Saving Models

Callbacks

Early Stopping

Backpropagation and Gradient Descent

Things to tweak in backpropagation

Optimizers

Activations

Intuition behind using better optimizers.

Kfold and stratifiedKFold cross validation

Regularizations

L1 and L2 regularizations

Dropouts

Bootstrapping and benchmarking hyperparameters

CNN Layers

ResNet

lr schedule code

Bulding ensembles using keras and sklearn

Hyperparameter tuning

Number of layers and Neuron counts

Activation functions

Advanced activation functions

Regularization

Batch Normalization

Training parameters

Bayesion Hyperparameter Optimization

LSTM

Transfer learning for computer vision

Use of MobileNet

YOLO Darknet for object detection

GANs

Discriminator

Generator

Training

Nvidia styleGAN2

GANs for semi supervised training