Introduction

Gradient descent is an iterative optimization algorithm that adjusts the variable vector $$\mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \dots \\ x_n \end{pmatrix}$$ to find the local minimum of a differentiable function $$f(\mathbf{x})$$. Starting from an initial guess $$\mathbf{x}_0$$ of vector $$\mathbf{x}$$, it uses the gradient of the function to make steps towards the minimum, controlled by the learning rate $$r$$. The process continues until the changes in the values of $$\mathbf{x}$$ are sufficiently small, the gradient magnitude is close to zero, or another stopping criterion is met.

The core idea of gradient descent is to iteratively adjust the variables to find the minimum of a function. This adjustment is guided by the update formula: $\mathbf{x}_{k+1} = \mathbf{x}_k - r \nabla f(\mathbf{x}_k)$ where $$\mathbf{x}$$ is a vector of variables, $$r$$ such that $$r \gt 0$$ is the learning rate, and $$\nabla f(\mathbf{x}_k)$$ is the gradient of the function $$f$$ at $$\mathbf{x} = \mathbf{x}_k$$ such that: $\nabla f(\mathbf{x}) = \begin{pmatrix} \dfrac{\partial f}{\partial x_1} \\ \dfrac{\partial f}{\partial x_2} \\ \dots \\ \dfrac{\partial f}{\partial x_n}\\ \end{pmatrix}$
The update formula comes from the concept that the gradient $$\nabla f(\mathbf{x})$$ points in the direction of the steepest increase of the function $$f$$. To minimize $$f$$, we move in the opposite direction of the gradient. The learning rate $$r$$ determines the size of the steps we take in this direction.

1. Initialization:
- Begin with an initial guess for the vector of variables $$\mathbf{x}$$. Let this initial value be $$\mathbf{x}_0$$.
- Choose a learning rate $$r$$, which controls the size of the steps taken towards the minimum.
- For the function $$f(\mathbf{x})$$, compute the gradient $$\nabla f(\mathbf{x})$$.
- Evaluate this gradient at the current value of $$\mathbf{x} = \mathbf{x}_k$$. Denote the gradient at iteration $$k$$ as $$\nabla f(\mathbf{x}_k)$$.
- Use the update formula to adjust the values of $$\mathbf{x}$$: $\mathbf{x}_{k+1} = \mathbf{x}_k - r \nabla f(\mathbf{x}_k)$ - Here, $$\mathbf{x}_{k+1}$$ is the updated value of $$\mathbf{x}$$ after one iteration.
- Repeat the process of computing the gradient and updating the values of $$\mathbf{x}$$ until a stopping criterion is met. The stopping criterion can be:
- A maximum number of iterations. - The change in the values of $$\mathbf{x}$$ between iterations is smaller than a predefined threshold.
- The magnitude of the gradient $$\|\nabla f(\mathbf{x}_k)\|$$ is smaller than a predefined threshold.
- Once the stopping criterion is satisfied, the value of $$\mathbf{x}$$ should be close to the value that minimizes the function $$f(\mathbf{x})$$.