Performance vs Data

Over the last few years a lot of people have asked me Andrew why is deep learning certainly working so well? And when I am asked that question, this is usually the picture I draw for them11 insert plot performance vs data

Let’s say we plot a figure where on the horizontal axis we plot the amount of data we have for a task. And let’s say on the vertical axis we plot the performance of our learning algorithm, such as the accuracy of our spam classifier, or our ad click predictor, or the accuracy of our neural net for figuring out the position of other calls for our self-driving car.

It turns out if you plot the performance of a traditional learning algorithm, like support vector machine or logistic regression, as a function of the amount of data you have you might get a curve that looks like this22 red curve, tradional NN

Where the performance improves for a while as you add more data, but after a while the performance, you know, pretty much plateaus, right? Suppose this is a horizontal line, I didn’t draw that very well. As I said, they didn’t know what to do with huge amounts of data.

Amounts of Data

What happened in our society over the last 20 years maybe is that for a lot of problems we went from having a relatively small amount of data to having you know often a fairly large amount of data. All of this was thanks to the digitization of a society where so much human activity is now in the digital realm. We spend so much time on the computers, on websites, on mobile apps and activities on digital devices creates data. And thanks to the rise of inexpensive cameras built into our cell phones, accelerometers, all sorts of sensors in the Internet of Things, we also just have been collecting more and more data.

So, over the last 20 years, for a lot of applications, we just accumulated a lot more data, more than traditional learning algorithms were able to effectively take advantage of. And with neural networks, it turns out, that if you train a small neural net, then this performance maybe looks like that33 yellow curve, small NN.

If you train a somewhat larger neural net, that’s called as a medium-sized neural net44 blue curve, medium NN to fall in something a little bit better and if you train a very large neural net55 green curve, large NN then it’s the performance and often just keeps getting better and better.

So couple observations: one is if you want to hit this very high level of performance then you need two things.

First, often you need to be able to train a big enough neural network in order to take advantage of the huge amount of data; and,
Second, you need to be out here on the x-axes66 projection of the large-NN performance, you do need a lot of data.

So, we often say that scale has been driving deep learning progress. And by scale I mean both, the size of the neural network; meaning just a neural network with a lot of hidden units, a lot of parameters, a lot of connections, as well as scale of the data.

In fact today one of the most reliable ways to get better performance in the neural network is often to either train a bigger network or throw more data at it. And that only works up to a point because eventually you run out of data or eventually then your neural network is so big that it takes too long to train, but just improving scale has actually taken us a long way in the world of deep learning.

In order to make this diagram a bit more technically precise, and just add a few more things.

I wrote the amount of data on the x-axis. Technically, this is amount of labeled data, where by labeled data, I mean training examples, we have both the input \(x\), and the label \(y\). I went to introduce a little bit of notation that we’ll use later in this course we’re going to use lowercase-alphabet-m \(m\), to denote the size of my training sets, or the number of training examples; this \(m\), so that’s the horizontal axis.

Couple other details to this figure. In this regime of smaller training sets, the relative ordering of the algorithms is actually not very well defined. So, if you don’t have a lot of training data is often up to your skill at hand engineering features that determines the performance.

So, it’s quite possible that, if someone training an SVM is more motivated to hand engineering features, and someone training even large neural net, maybe in this small training set regime, the SVM could do better.

So, you know in this region to the left of the figure, the relative ordering between the algorithms is not that well defined, and performance depends much more on your skill at handling features, and other details of the algorithms, and there’s only in this some big data regime, very large training sets, very large \(m\) regime in the right, that we more consistently see large neural nets dominating the other approaches.

And so if any of your friends ask you why are known as you know taking off I would encourage you to draw this picture for them as well.

Improved Algorithms

So I will say that in the early days in their modern rise of deep learning it was scaled data and scale of computation, just our ability to Train very large dinner networks either on a CPU or GPU that enabled us to make a lot of progress. But increasingly, especially in the last several years we’ve seen tremendous algorithmic innovation as well so I also don’t want to understate that.

Interestingly, many of the algorithmic innovations have been about trying to make neural networks run much faster.

So, as a concrete example one of the huge breakthroughs in your networks has been switching from a sigmoid function, which looks like this77 Sigmoid function, to a ReLU function which we talked about briefly in an early video that looks like this88 ReLU function.

If you don’t understand the details of one about the state, don’t worry about it. But it turns out that one of the problems of using sigmoid functions in machine learning is that there these regions here99 asymptotes, where the slope of the function, where the gradient is nearly zero; so learning becomes really slow. Because when you implement gradient descent and the gradient is zero the parameters just change very slowly, and so learning is very slow. Whereas, by changing the, what’s called the activation function, the neural network to use this function called the ReLU function, or the rectified linear unit, the gradient is equal to one for all positive values of input, right? And so the gradient is much less likely to gradually shrink to zero and the gradient here, the slope of this line is zero on the left but it turns out that just by switching to the sigmoid function to the ReLU function, has made an algorithm called gradient descent work much faster. So this is an example, of maybe relatively simple algorithmic innovation but ultimately the impact of this algorithmic innovation was it really help computation.

So, there are quite a lot of examples like this, where we change the algorithm because it allows that code to run much faster, and this allows us to train bigger neural networks or to do so the reasonable amount of time, even when we have a large network for all the data.

Faster Computation

The other reason that fast computation is important is that it turns out the process of training your network this is very iterative. Often you have an idea for a neural network architecture and so you implement your idea in code implementing your idea then lets you run an experiment which tells you how well your neural network does and then by looking at it you go back to change the details of your neural network and then you go around this circle over and over.

And when your neural network takes a long time to train it just takes a long time to go around this cycle. And there’s a huge difference in your productivity building effective neural networks when you can have an idea and try it and see if it works in ten minutes, or maybe at most a day, versus if you’ve to train your neural network for a month, which sometimes does happen because you get a result back you know in ten minutes or maybe in a day you should just try a lot more ideas and be much more likely to discover in your neural network what works well for your application.

So faster computation has really helped in terms of speeding up the rate at which you can get an experimental result back and this has really helped both practitioners of neural networks as well as researchers working in deep learning, iterate much faster and improve your ideas much faster.

Closing words

And so all this has also been a huge boom to the entire deep learning research community which has been incredible with just you know inventing new algorithms and making nonstop progress on that front.

So these are some of the forces powering the rise of deep learning but the good news is that these forces are still working powerfully to make deep learning even better.

Take data. Where society is still throwing up one more digital data.

Or take computation, with the rise of specialized hardware like GPUs and faster networking, many types of hardware. I’m actually quite confident that our ability to build very large neural networks, from the sheer computation point of view, will keep on getting better.

And take algorithms. Where the whole deep learning research communities still continue doing phenomenal at innovating on the algorithms front.

So, because of this, I think that we can be optimistic, that Deep Learning will keep on getting better for many years to come.

So that. Let’s go on to the last video of the section where we’ll talk a little bit more about what you learn from this course.

Why is Deep Learning taking off

Week 1, Lecture 04

Author: Andrew Ng. Tufte notebooks: Alfonso R. Reyes

2017-12-01

Introduction

Performance vs Data

Amounts of Data

Improved Algorithms

Faster Computation

Closing words