Skip to content Skip to sidebar Skip to footer

Machine Learning, Data Science and Deep Learning with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on machine learning tutorial with data science, Tensorflow, artificial intelligence, and neural networks

14:16:26 of on-demand video • Updated April 2021

A free video tutorial from 

Sundog Education by Frank Kane

Founder, Sundog Education. Machine Learning Pro

Udemy Coupon ED

Course summary

  • Build artificial neural networks with Tensorflow and Keras
  • Classify images, data, and sentiments using deep learning
  • Make predictions using linear regression, polynomial regression, and multivariate regression
  • Data Visualization with MatPlotLib and Seaborn
  • Implement machine learning at massive scale with Apache Spark's MLLib
  • Understand reinforcement learning - and how to build a Pac-Man bot
  • Classify data using K-Means clustering, Support Vector Machines (SVM), KNN, Decision Trees, Naive Bayes, and PCA
  • Use train/test and K-Fold cross validation to choose and tune your models
  • Build a movie recommender system using item-based and user-based collaborative filtering
  • Clean your input data to remove outliers
  • Design and evaluate A/B tests using T-Tests and P-Values

Udemy Course

Lesson transcript

So let's make this real! Let's look at some actual Spark code to make a decision tree using mllib that can actually scale up to a cluster if we wanted to. It's actually pretty simple. Let's take a look. So let's play around Spark and mllib. Open up your Anaconda prompt or your terminal depending on your operating system. And by the way if you did just install Spark, remember we set some environment variables so you will need to close and reopen your Anaconda prompt if you have one open already to pick those up. All right. So let's cd into our course materials folder as we always do and in here there are a couple of python scripts that we can use with Spark. Now unlike before we can't actually run this in a notebook. So we're instead we're just going to use whatever a text editor we have to look at these files and kind of go through what they're doing. We may as well use Spyder, that's the python editor that comes with Anaconda. So just go ahead and type in spyder with a Y and wait for that to come up. And here we are. So go ahead hit the open icon and navigate to your course materials see mlcourse and we want the script All right so let's walk through what's going on here shall we. Now again we're not using an iPython notebook this time around we're actually just using a standalone python script. Hence the .py extension instead of ipynb. It is actually possible to run Spark code within a notebook but it involves even more set up steps and I think we've done enough of that for now just for running a couple of Spark examples here. So let's just keep this as a standalone script in the real world the way that you will run this on a cluster will typically be you will copy the script to the master node of that cluster and there's a script called Spark-submit that comes with spark that will actually interpret that script and distribute it throughout the rest of the cluster for you. So that's really the way you would want to do it in the real world anyway. It's possible to kick it off with a notebook but it's just a little bit more trouble than I want to deal with right now. Anyway let's walk through what this script is doing. Simple enough. So this may be new to you so I'll go through it all a little bit slowly. We start by importing all the packages we need of course and we need some stuff from mllib obviously if we're gonna be doing mllib code we need a something called a labeled point. And the decision tree itself from mllib both of which we talked about earlier and pretty much every Spark script is going to import SparkConf and SparkContext as well. We're also going to import array from numpy which allows to use numpy arrays as we're manipulating our data and preparing it here. Now keep in mind that spark is not going to magically make everything from numpy and scikit learn distributable and parallelized across a cluster. If you call numpy or scikit learn functions within the script it's just going to be running it within the specific node that this is running on. So it's not going to automatically distribute that work across your cluster for you. You have to be using the actual functions within mllib for that to happen. So keep that in mind. Yes you can still use numpy and scikit learn in here but those methods will not be distributed. If you want distributed machine learning you've got to stick with what's in mllib. All right. So to kick off a spark script first we need to set up a spark context which is the environment that we're running spark within and that basically takes care of all the niggly details of how to actually distribute this stuff and how to organize the the order in which things are run and assembled back together across your cluster. The beauty of spark is that it does all that thinking for you. You don't have to worry about that part of it to set up a spark context however we need a configuration object first. And what's going on here is we're setting a new spark conf object set master local means that we're only going to be running it on our local P.C. for this example because I don't have a cluster handy if you were running on a real cluster you would change that to something else. And we're also gonna set an application named so that when you view this on the spark console if we had one running you would see it referred to as that name so that we set up our spark context and we'll skip these functions for now we'll get back to them when we actually call them and if we go down below these functions you start to get to the actual lines of code that will actually be executed here. So we start off by loading up our raw data from the past hires.csv file we saw this earlier in our decision tree example let's go ahead and open that up to refresh ourselves on what it looks like so if you go to our course materials we should find it in here pasthires.csv let's open that up and this will open up an Excel for me. So it's gonna make it all look like a pretty table even though it is just a comma separated value file. So again we have our structure here as the first line is the headings for the actual columns here. So our first row tells us what these columns mean years experience whether or not they're employed previous number of employers. Level of education. So on and so forth. And that's before we have a lot of data here that needs to be converted into numerical data. Just like any machine learning algorithm it deals better with numbers than with letters. So we're gonna have to transform these Ys and Ns into ones and zeros. And these BS, PHD, and MS levels of education will need to be converted into numerical ordinal data instead. So that's what we're dealing with here. Let's go back to our script. All right. So the first thing we need to do is strip that header row off because that's not actually useful information for the algorithms. Right. So to do that the trick we're doing is this we say header = rawdata.first. So what happened when we call sc.textFile is it loaded up every individual row of that csv file into an rdd called Raw Data. OK. So now we have an rdd called raw data that just contains nothing but the raw comma separated strings of each row of that data. What we're doing here is extracting the first row of that data which is gonna be our header row that just contains the names of the columns and then we can call the filter function on our raw data rdd with a lambda function. Again this is an inline function basically that says as long as the given row does not equal to header row we'll preserve it. So by doing this we basically make a copy of raw data that actually filters out that first header row and we save that into a new raw data. So basically we have a raw data rdd at this point where that first header row has been filtered out. Now this is as good a time as any to mention that in modern Spark code there's something called a data set instead of an rdd and that tends to be used more widely these days because it has slightly better performance in some instances. Well it has a lot better performance in some cases it depends how you're using it. And it also lets you just execute SQL against the data right in place. So because of those conveniences people are migrating more toward using data sets instead of rdds it's basically a higher level structure. But in this case it doesn't really make a big difference. So we can use rdds mllib will work basically the same way with it. So we're gonna stick with rdds for now my way of looking at it is if you have a simple solution and a more complicated solution and there's no big performance difference. Stick with a simple solution. So I'm gonna stick with rdds here but just so you know when you talk to people today about spark they're probably gonna be working with data sets or data frames instead of rdds same general concept that just has more functionality. All right. So now we need to actually split our comma separated values into actual fields here and to do that we're gonna call a map function and we're just gonna do a little inline lambda function here again the call split on the actual line using the comma so that we'll take each row of data and split it up based on the commas into individual fields in the list. rddRTD called csvData where we've actually structured that data somewhat we've actually taken out the commas instead of just one value that contains a big comma separated list of stuff. We have a row that contains individual fields that we're interested in. Now we need to actually convert those fields to what we want. So we will call map with an actual function at this point called create labeled points so let's move up to that function and see what it does. All right. So create labeled points takes in a list of fields that came in from our csv data after separating it based on the commas and it converts them into the format that we actually need for training our decision tree. So the first thing we do is convert the first fields which represents the years of experience into an integer instead of a string. We will take the employed field and call our binary function on it. So fields one is going to be a field that contains either the letter Y or N. Right. That indicates whether or not they're currently employed the binary function just says if it's a y return one else return zero. So this function is going to be called each time on each row to convert that y to a one or n to a zero. OK. So remember machine learning generally wants numbers not strings or letters whenever possible we'll convert the previous number of employers to an integer from a string. The education level we'll call this map education function on that field and that just converts BS, MS, and PHD to the ordinal values 1, 2, and 3 and we will just call the binary function again to convert the ys and ns on whether they came from a top tier school where they had a previous internship and the final label data of if they were hired or not from Ys and Ns to zeros and ones. And as you may recall mllib wants labeled points as its input. So we're going to return a labeled point structure that contains the label which is the higher field followed by all the feature data which will be an array that contains the years of experience whether or not they're employed previous employer or so on and so forth. So the label point contains the label which is the thing that we're trying to discover whether or not they should be hired. And then the features which is all the different features of each person that might influence whether or not they would be hired or not. All right. So at this point we go back down to where this was called. We have a new rdd called training data that contains all of our training data converted into numerical data and ultimately converted into labeled points which is what mllib expects. So awesome. Now we can start playing with mllib so let's create a set of test candidates to actually try this out with this example we'll just set up one person here. So we're going to set up an array that contains information that represents 10 years of prior experience. They are currently employed. They had three previous employers. They currently have a B.S. degree. They are not from a top tier school and they did not do an internship. OK so we've sort of set up this fake test candidate to see if we can actually make a prediction on this new person that we haven't seen before. Once our decision tree has been created and then we take that test candidate and create an rdd out of it so we can actually feed it into spark using the parallelize function that just converts this array of test candidates which is really just one candidate into an rdd called test data. Next we will actually make our decision tree classifier we'll call it model and we can just call decision tree that comes from the mllib library trainClassifier passing in our training data rdd that contains all the labeled training data and a bunch of hyper parameters here. Num classes indicates that we only have two classes that we're trying to sort people into whether or not they're hired yes or no that's two different classes. We also have to pass in an array defining which of our features are categorical in nature and then we can specify how the actual decision tree itself is constructed with what impurity function its maximum depth and the maximum number of bins. All right. Once we have that model trained we can actually use it to make predictions. So we will do that. We will just call a model.predict given our test data rdd that contains our test candidate and we will print out the results of that we'll just print out the actual result of that prediction. And here's the important point here. So at this point we're actually saying I want to call predictions.collect. I want to actually get something back from Spark giving me an answer. It's not until this point that spark actually does anything. So all that's been happening up to this point is that a directed acyclic graph is getting constructed of all the stuff that spark needs to do to produce a set answer at large scale. Once I actually say I want a result I want an answer you will go back and construct the optimal way of putting it all together and the optimal way of distributing it. If I were on a cluster and at that point it'll go off and start chugging away and producing an answer for me. So we'll actually print out our ultimate hired prediction and we'll also print out the model itself. There's a handy to debug string method on the decision tree model that will allow us to sort of understand what's going on inside of the decision tree and what decisions it's making based on what criteria. So with that we can try it now again with Spark. We need to run that actually within the spark environment itself. I can't just run this from within spyder at least not without doing a bunch of extra setup steps. So let's close out of spyder or at least minimize this for now. And if we go back to our Anaconda prompt it's actually open up a new one huh back to anaconda anaconda prompt. This will make sure that we have anacondas Python environment available to us. Again we'll cd to our course materials and now what we can do is type in spark-submit followed by that script name which was SparkDecision now this spark.submit script is part of spark itself. This is what actually takes a script and decides how to distribute it and actually feed it into the spark engine. Let's hit enter and see what happens so if you installed spark successfully you should be seeing something like this and there we have it. All right. So for our test user there we actually predicted that we would hire that person. And we also have the actual decision tree itself printed out here. Now we actually can't obviously do it nice pretty graphical representation like we did before because we're just in the command console here. But you can still interpret this. So basically it says if feature 1 in 0. So the way to interpret that is if we look back at our source data here if we start counting at zero feature one zero one would be employed. OK. And remember we converted why an end to 1 and 0. So it says basically if you're not employed if feature 1 which is employed is in the set and 0 which contains a single value of 0. So for categorical data you'll see that syntax in you know curly brackets and whatever the categories are. All right. So if you're not employed and if feature 5 is also 0 so 0 1 2 3 4 5 that's an internship. So if you're unemployed you did not do an internship and this says if you have less than half a year of experience basically you have no experience and you have only a Bachelors of Science degree. We will not hire you is what the prediction is. And you can go through and figure out the rest of the structure here if you want to but that's how you read this stuff basically. Cool so there you have it an actual decision tree running within Apache Spark and although that seems like sort of a convoluted way of doing things on a single computer I mean it is the beauty is that if you were to actually run this on the master node of a real Hadoop cluster or a real spark cluster it would just work you would actually distribute that work across the entire cluster. How cool is that. So you could actually feed in a massive data set of training data and a massive set of people that you want to make predictions for and it could distribute that throughout an entire cluster and give you back results no matter how big that dataset might be. So that's what's really exciting about this. You know you could imagine a world where you're working for some huge company or some company that produces you know hiring software recruiting software and you could actually run this at massive scale across a massive number of people. I'll leave aside the ethical concerns of actually doing something like this in the real world where you're just trying to boil people down into a number and feeding them into a model I mean obviously I wouldn't really want the real hiring decisions to be based just on that alone that would be not a world I want to live in. But for the sake of illustration this is how it would work. So there you have a decision trees in spark running for real. And there you have it's an actual decision tree built using spark and mllib that actually works and actually makes sense. Pretty awesome stuff. So you can see it's pretty easy to do and you can scale it up to as large of a data set as you can imagine if you have a large enough cluster. So there you have it.

Udemy Coupon Code

Online Course CoupoNED
Online Course CoupoNED I am very happy that there are bloggers who can help my business

Post a Comment for "Machine Learning, Data Science and Deep Learning with Python"

Subscribe via Email