Gradient Descent

1. What is Gradient Descent

Gradient descent is a popular optimization algorithm used in machine learning and other areas of mathematical optimization. The main goal of the this algorithm is to minimize a function by iteratively adjusting its input parameters.

The basic idea behind this is to compute the gradient of the objective function with respect to the input parameters and then update the parameters in the opposite direction of the gradient. This process is repeated until a local minimum of the objective function is reached.

This algorithm can be used for both convex and non-convex optimization problems. However, the convergence rate and final solution quality can vary depending on the problem’s properties and the algorithm’s hyperparameters.

2. Equation of the Gradient Descent

The basic equation for gradient descent is as follows:

# Equation of Gradient Descent
theta = theta – alpha * gradient

where theta is the vector of parameters being optimized, alpha is the learning rate, and gradient is the gradient vector of the objective function evaluated at the current value of theta.

3. What is Cost-function?

A cost function is a mathematical function that measures the difference between the predicted values and actual values of a machine learning model. It is used to evaluate how well the model is performing on the training data by calculating the difference between the predicted values and the actual values.

The equation of the cost function depends on the specific problem being solved and the type of machine learning algorithm being used. In general, the cost function is a function of the parameters of the model, and its value is a measure of how well the model is performing. The goal of machine learning is to find the set of parameters that results in the lowest possible cost, which is done using optimization algorithms such as gradient descent.

3.1 Examples of Cost Functions

3.1.1 Mean Squared Error (MSE) Cost Function

The MSE cost function is used in regression problems to measure the average squared difference between the predicted values and actual values. The equation for MSE is:

# Mean Squared Error (MSE) Cost Function
J(w,b) = (1/2m) * Σ(y_predicted – y_actual)^2

Where,

J(w,b) is the cost function

w and b are the parameters of the model

m is the number of training examples

Σ is the sum over all the training examples

y_predicted is the predicted value

y_actual is the actual value

3.1.2 Cross-Entropy Cost Function

The cross-entropy cost function is used in classification problems to measure the difference between the predicted probabilities and the actual probabilities. The equation for cross-entropy is:

# Cross-Entropy Cost Function
J(w,b) = (-1/m) * Σ(y_actual * log(y_predicted) + (1-y_actual) * log(1-y_predicted))

where,

J(w,b) is the cost function

w and b are the parameters of the model

m is the number of training examples

Σ is the sum over all the training examples

y_predicted is the predicted probability

y_actual is the actual probability

4. How does Gradient Descent work?

Gradient Descent is an optimization algorithm used in machine learning to find the set of parameters that minimizes a cost function. The basic idea behind this algorithm is to iteratively adjust the parameters of the model in the direction of the steepest descent of the cost function.

The algorithm works by taking the derivative of the cost function with respect to each parameter, which gives the direction of the steepest ascent. Then, the algorithm adjusts the parameters in the opposite direction of the gradient, which results in a decrease in the cost function. This process is repeated until the cost function reaches a minimum.

4.1 Steps of the Gradient Descent algorithm

4.1.1 Data

Let’s consider a simple linear regression problem where we want to predict the price of a house based on its size.

Size (sq. ft.)Price (1000s)210446014162321534315852178147233213802791494310194054020004891890427House Price Data Set

We want to fit a linear model of the form y = mx + b to this data, where x is the size of the house and y is the price. We can use this to find the optimal values of m and b that minimize the mean squared error cost function.

4.1.2 Initialize the parameters

First step is to initialize the parameters of the model to some random values. These parameters will be adjusted during the optimization process to minimize the cost function.

We start by initializing m and b to some random values. Let’s set them both to 0.

import numpy as np

# Load the data
data = np.array([[2104, 460], [1416, 232], [1534, 315], [852, 178], [1472, 332],
[1380, 279], [1494, 310], [1940, 540], [2000, 489], [1890, 427]])

# Extract the features and labels
X = data[:, 0] # size
y = data[:, 1] # price

# Initialize the parameters
m = 0
b = 0

4.1.3 Calculate the cost function

The next step is to calculate the value of the cost function using the current values of the parameters.

We calculate the value of the cost function using the current values of m and b. The cost function for linear regression is the mean squared error, which is defined as:

# Calculate cost function
def mse(y_pred, y_actual):
return np.mean((y_pred – y_actual)**2)

y_pred = m*X + b
cost = mse(y_pred, y)

4.1.4 Calculate the gradients

We calculate the derivatives of the cost function with respect to m and b. The gradients for linear regression are:

# Calculate the gradients
dm = (2/len(X)) * np.sum((y_pred – y) * X)
db = (2/len(X)) * np.sum(y_pred – y)

4.1.5 Update the parameters

We update the parameters of the model by subtracting the gradient multiplied by the learning rate. Let’s set the learning rate to 0.01.

# Update parameters
lr = 0.01
m -= lr * dm
b -= lr * db

4.1.6 Repeat

We repeat steps 2-4 until the cost function reaches a minimum or until a predetermined number of iterations have been reached. Let’s set the maximum number of iterations to 100.

for i in range(100):
# Calculate the predictions
y_pred = m*X + b

# Calculate the cost
cost = mse(y_pred, y)

# Calculate the gradients
dm = (2/len(X)) * np.sum((y_pred – y) * X)
db = (2/len(X)) * np.sum(y_pred – y)

# Update the parameters
m -= lr * dm
b -= lr * db

# Print the progress every 10 iterations
if i % 10 == 0:
print(f”Iteration {i}: m={m:.2f}, b={b:.2f}, cost={cost:.2f}”)

Yields below output.

# OUTPUT:
Iteration 0: m=86.87, b=4.82, cost=315350.60
Iteration 10: m=79.01, b=15.09, cost=232091.29
Iteration 20: m=71.50, b=24.85, cost=170343.57
Iteration 30: m=64.33, b=34.15, cost=125564.03
Iteration 40: m=57.49, b=43.04, cost=92660.94
Iteration 50: m=50.98, b=51.54, cost=68277.57
Iteration 60: m=44.78, b=59.68, cost=50435.26
Iteration 70: m=38.89, b=67.49, cost=37412.19
Iteration 80: m=33.29, b=74.98, cost=27940.79
Iteration 90: m=28.00, b=82.18, cost=21064.45

As we can see from the output, the cost function decreases with each iteration, indicating that the values of m and b are getting closer to the optimal values that minimize the cost function. After 100 iterations, we can use the final values of m and b to make predictions on new data.

I hope this example helps you understand how Gradient Descent works.

5. Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let’s understand these different types of gradient descent:

5.1 Batch

In batch gradient descent, the algorithm updates the model parameters (weights and biases) based on the average of the gradients of the entire training dataset. It calculates the gradients for the whole dataset and then updates the parameters accordingly. It’s computationally expensive for large datasets, but it converges faster than other gradient descent algorithms.

5.1.1 Advantages Of Batch Gradient Descent

Converges faster than other gradient descent algorithms.

More stable because it averages out the noise in the gradients.

Suitable for small to medium-sized datasets where the entire dataset can fit into memory.

5.2 Stochastic

In stochastic gradient descent (SGD), the algorithm updates the model parameters for each training example. It calculates the gradient for a single training example and then updates the parameters. The stochastic nature of this algorithm introduces some randomness and makes it less likely to converge to a local minimum. However, it can take a longer time to converge than batch gradient descent.

5.2.1 Advantages Of Stochastic Gradient Descent

Computationally efficient because it updates the model parameters for each training example.

Can converge faster than batch gradient descent because it takes smaller steps in the parameter space.

Suitable for large datasets where the entire dataset cannot fit into memory.

5.3 Mini-Batch Gradient Descent

Mini-batch gradient descent is a compromise between batch and stochastic gradient descent. It updates the model parameters based on a small random subset of the training dataset. It is less computationally expensive than batch and less noisy than stochastic gradient descent. It’s a commonly used algorithm in deep learning.

5.3.1 Advantages Of Mini Gradient Descent

A compromise between batch gradient descent and stochastic gradient descent.

More stable than stochastic gradient descent because it reduces the variance of the gradient updates.

Can take advantage of parallel processing, making it computationally efficient.

Suitable for large datasets where the entire dataset cannot fit into memory and can be used in combination with hardware accelerators like GPUs or TPUs.

Each type of gradient descent has its own advantages and disadvantages, and the choice of algorithm depends on the specific problem, the size of the dataset, and the available hardware resources.

6. Challenges with the Gradient Descent

Here are some of the challenges associated with Gradient Descent:

Local Optima: Descent can converge to a local optima instead of the global optima, depending on the initial starting point and the shape of the cost function.

Learning Rate: The learning rate determines the size of the steps taken during each iteration, and choosing the appropriate learning rate is important for the algorithm to converge to a solution. If the learning rate is too small, the algorithm may take a long time to converge, and if the learning rate is too large, the algorithm may not converge at all or oscillate around the solution.

Overfitting: Descent may overfit the training data, which can lead to poor generalization performance on unseen data.

Feature Scaling: If the features in the dataset have different scales, Descent may take longer to converge or may not converge at all. Therefore, it is important to scale the features appropriately.

Memory Constraints: In the case of large datasets, it may not be possible to store the entire dataset in memory, which can make it challenging to compute the gradient for the entire dataset.

Local Minima: Sometimes, Descent can get stuck in local minima, which can lead to suboptimal solutions.

Gradient Vanishing/Exploding: The gradients may become very small or very large during the training process, which can make it difficult for the algorithm to converge.

These challenges should be carefully considered when applying Gradient Descent to a problem, and appropriate measures should be taken to mitigate them.

7. Conclusion

In conclusion, gradient descent is a widely used optimization algorithm for machine learning and other mathematical problems. It works by iteratively updating the parameters of a model to minimize a cost function. The algorithm can be used with various types of cost functions, including linear regression, logistic regression, and neural networks.

Gradient Descent is known as one of the most commonly used optimization algorithms to train machine learning models by means of minimizing errors between actual and expected results. Further, gradient descent is also used to train Neural Networks. 1. What is Gradient Descent Gradient descent is a popular optimization algorithm used in machine learning and Read More Machine Learning

Gradient Descent In Machine Learning Narender Kumar Spark By {Examples}