Gradient Descent
optimization algorithm that helps us find the parameters of a machine learning model.
We have learned about Regression and its error/loss function, and we were trying to find the parameters theta for which this loss is minimum.
Gradient descent uses the given equation to formulate the theta for which the error is minimum. It keeps updating the theta until convergence is achieved.
Fully connected neural networks - every neuron in a layer is connected to every neuron in its next layer.
Training Of A Neural Network
In the training stage of the neural network, weights are assigned to each connection between neurons. These weights are learnable parameters that are updated to find the optimal values.
Training of a neural network contains 2 main steps:
Forward Propagation
Forward propagation is how neural networks make predictions. Input data is “forward propagated” through the network layer by layer to the output layer which makes a prediction.
Backpropagation
In ****backpropagation, we propagate through the neural network backward i.e., from the output layer to the input layer, and update the weights and biases of the neural network.
Let's understand this in detail with the help of an example.
Let’s consider a neural network with “N” inputs and a single neuron in the hidden layer:
Step 1: Forward Propagation - b*x + w
Step A:
In forward propagation, the data points from the input layer are propagated to a single neuron where each input is multiplied with its respective weights and then summed together. Each neuron has also an error term called bias. The sum of the bias term and the linear combination of inputs and weights is the input to the single neuron.
Step B:
In this step, we apply a nonlinear function to this linear combination. The functions we apply to these linear combinations are also known as Activation Functions (e.g. sigmoid, tanh, ReLU, softmax). Activation Functions are supposed to introduce nonlinearity into our Neural Network. Simple linear functions in neural networks might not be helpful in learning complex patterns in data, hence we use non-linear activation functions to be able to learn complex patterns in our data.
Step 2: Calculate the Loss Function
After getting the output as a result from forward propagation, we will calculate the loss using the loss function. The weights and biases are updated in such a way that the loss function is minimized. There can be different types of loss functions depending on the nature of the problem. For example, for regression, we usually use mean squared error and for classification we use cross-entropy.
The loss function (or error) is for a single training example, while the cost function is over the entire training set
Step 3: Backpropagation
We try to reflect the error or cost term onto the weights of our Neural Network. Thus the way to do it is, we take the derivative of the cost with respect to a particular weight and then we shift the value of the weights in that direction.
The algorithms used to update the weights and biases are known as Optimizers.
A few well-known optimizers are Gradient Descent, SGD, Batch SGD, etc.
Step 4: Repeat Forward Propagation and Backward Propagation until the cost function is minimized.
We repeat Forward Propagation and Backward Propagation until the cost/objective function is minimized.
The below graphic representation shows a single iteration of forward and backward propagation. In forward propagation, first, calculate the value for each node using the input layer and the activation functions. Secondly, make the predictions using the output layer and calculate the error/loss function using the predicted and the actual labels. In backward propagation, the weights and biases are updated using derivatives to optimize the loss function.
Convolutional Neural Network
used in image processing and image classification
Detect features like edges & shapes using convolutional operations with different filters
Parameters to change:
- Padding - to prevent reduction in image after convolution
- Stride - pixel shift used in convolution operation - to reduce size of image output
- Pooling - taking average or max of 2x2 or bigger area - to reduce input image size
then flatten the matrix in the form of a vector to feed it into the fully connected layer.