Loss Iteration
To calculate a model that minimizes loss, we can try iterative learning. That is: take a guess of a weight and bias, calculate the loss, then try another guess and see if the loss is greater or higher. We continue making more guesses in directions which we expect to lead to iteravely smaller loss, until we reach convergence: a point where the loss stops changing, or changes in negligibly small amounts
Gradient Descent
For linear regression, if we were to plot the amount of loss against the value of weight guesses, we should always see a convex/”bowl” shape on the created graph. Think about it: if the slope is too high loss will be high, if the slop is too low loss will be high, and loss will be lowest between those “too high” and “too low” values: this is the bowl shape.
Trying to calculate the weight that minimizes loss by just choosing values and guessing is hardly efficient. Instead, we can use gradient descent. That is: we can calculate the gradient/slope of the loss for our guesses. If the slope is negative, we know we need a higher weight. If the value is positive, we know we need a lower weight. And we can continue iterating until we reach a slope of zero (that is, the bottom of the “bowl”, which will be where loss is minimal).
Learning Rates
When we are iterating over values to find convergence, we need to decide how big the gaps should be between each guess. Too small, and it will take too many steps to get to convergence. Too large, and we’ll overshoot the convergence point too much on each iteration, bouncing for far too many steps between “too high” and “too low”.
Gradient descent algorithms multiply the gradient by the learning rate to determine the next example point. Eg: if the gradient magnitude is 5, and the learning rate is 0.1, the next point will be 0.5 away from the current point.
Data scientists will spend much of their time adjusting things such as the learning rate (and other hyperparameters) to optimize efficiency