Automated World: Gradient Descent

Hello, and welcome to this blog on Gradient Descent.

Derivative:

As the name suggest, a derivative is a function that derives from another function.

Let's start with an example. Imagine you are driving on the highway as time goes by, you mark your position along the highway filling a table of values as a function of time. If you're speed is 60 miles an hour every minute your position will be increased by one mile.

Let's define the function, x(t) to indicate your position as a function of time. The derivative of this function is the rate of change in position with respect to time.

So the speed will be

At each point along the curve the derivative is the value of the slope of the curve itself. We can calculate the approximate value of the slope by the method of finite differences. The value of the derivative is negative when the slope of the original curve is downhill and it is positive when the slope is uphill.

Gradient:

When our function has more than one input we need to specify which variable we are using for derivation. For example, let's say we are measuring our elevation on a mountain as a function of our position. Our GPS position is defined by two variables longitude and latitude.

elevation=f(long,lati)

And therefore, the elevation depends on two variables. Let's change the variable names to shorter ones. Let's call the elevation y, and the two variables x₁and x₂

y=f(x1,x2)

We can calculate the rate of change in elevation with respect to x₁, and the rate of change with respect to x₂.

These are called partial derivatives because we only consider the change with respect to one variable. Notice also that we use the different symbol to indicate these derivatives because these are partial derivatives.

In the two-dimensional plane of x₁and x₂, the direction of the most abrupt change will be a two-dimensional vector whose components are the partial derivatives with respect to each variable.

We call this vector gradient and we indicate it with an inverted triangle

which is also called del or nabla.

The gradient is an operation that takes a function of multiple variables and returns a vector. The components of this vector are all the partial derivatives of the function. Since the partial derivatives are functions of all variables, the gradient, too, is a function of all variables.

Back-propagation Intuition:

Let's say we have a function of only one variable called w. For every value on the horizontal axis, the function associates a value on the vertical axis. Let's say we're sitting at a particular point like in the figure.

Let's also assume that we do not know the function f(w) at every possible point. We only know it near where we are. We want to move in the direction of decreasing f(w), but we can only use local information. How do we decide where to go? As we've when we talked about descending from a hill, the derivative indicates its slope at each point, so we can calculate the derivative where we are, and then change our position by subtracting the value of the derivative from the value of our starting position w. In other words, we can take one step, following the rule

If we are sitting at w, the slope of the curve is negative, and thus, the quantity minus

is positive, so the value of w will increase.So we have to move towards the right on the horizontal axis. So the corresponding value on the vertical axis will decrease, so we successfully moved towards the lower value of the function f of w.

This way of looking for the minimum of a function is called gradient descent, and it's the idea behind back-propagation. Given a function, we can always move towards its minimum by following the path indicated by its derivative, or, in the case of multiple variables, indicated by the gradient. As you know by now, for a neural network, we define a cost function that depends on the values of the parameters, and as you also know, we find the values of the parameters by minimizing the cost by gradient descent. All we are really doing is taking the cost function, calculating its partial derivatives with respect to each parameter, and then using the update rule we just described to decrease the cost by updating the parameter. We do this by subtracting the value of the negative gradient from each of the parameters.

This is what's called a parameter update.

Learning Rate:

As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long:

Learning rate is too small.

Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:

Same U-shaped curve. This one contains very few points. The trail of points jumps clean across the bottom of the U and then jumps back over again.

Learning rate is too large.

There's a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.

Same U-shaped curve. The trail of points gets to the minimum point in about eight steps.

Learning rate is just right.

Reducing Loss: Stochastic Gradient Descent

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we've assumed that the batch has been the entire data set. But in real time the data sets often contain billions or even hundreds of billions of examples. Consequently, a batch can be enormous. A very large batch may cause even a single iteration to take a very long time to compute.

A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.

What if we could get the right gradient on average for much less computation? By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

Automated World

Monday, April 1, 2019

Gradient Descent