Data Science Webinars
Rishabh Gupta is a Senior Data Scientist at Jetstar, who helps business to make data-driven decisions by transforming data into meaningful stories and actionable insights.
Rishabh has over 8 years experience working in companies like Accenture, Altisource, General Electric where he led and implemented multiple analytics projects.
Rishabh holds a bachelor’s degree in Electronics Engineering from Uttar Pradesh Technical University and Executive Programme in Business Analytics from Indian School of Business.
When he isn’t working, you’ll find Rishabh playing PUBG.
Key takeaways for you
- Understand the intuition behind support vectors and SVM
- Learn applications of SVM in ML
- Understand the workings of SVM
- Learn how to implement SVM in Python
We’ll start off so today’s lecture is on a classification using support vector machines and the session is taken by Rishabh Gupta. He is from Australia Melbourne is a Senior Data Scientist at Jetstar over there and there he has business to make data-driven decisions by transforming data into meaningful stories and actionable insights.
Now, Rishabh has over eight years of experience working in companies like Accenture, Altisource, General Electric, where he led and implemented many analytics projects.
Rishabh also holds a bachelor’s degree in electrical engineering from Uttar Pradesh Technical University and an executive program in business analytics from the Indian School of Business which is one of the dream colleges for a lot of students. Also, this is for all the Pub-g players when Rishabh is not busy working, he enjoys playing Pub-G and so I think you guys can catch up after session for such things. Great.
Hello, everyone. Can everyone hear me?
Okay. So good morning, and good afternoon to people, depending on your time zones. So today, we’ll be covering support vector machines.
So here’s the agenda for today’s session, I’ll be going over the SVM basics and then I’ll be going over to the SVM objective function. Now I’ll be looking over to the slack variables conetrix have parameters and then the coding session.
So in that SPM objective session, in the SPM objective function, I will be covering the maths only a little bit, I won’t be going into detail because of the time constraint, shall we just giving you guys an overview of it. Okay, so let’s just get started. So I’ll begin with the very basic thing, the definition of AI and machine learning, what is artificial intelligence, and what is machine learning. So in the past, like, for me as well, I’ve been using them incorrectly. But the actual definition of machine learning is basically when there is basically a bunch of computer algorithms that are trying to uncover insights, or they’re trying to predict some train, that is machine learning. Now, artificial intelligence is a system that is enabled by machine learning to perform tasks that require human intelligence are a very simple example of that one would be spam filters that we use on our Google genius. Now, what they do basically is whenever our email comes to our Gmail, it reads the data and tries to classify whether it’s spam or it’s not spam. So this is our example of an artificial, intelligent machine. So basically, we use machine learning methods to create these AI systems. Okay. I’m seeing it. Okay. I’ll be a little loud. I just got a comment from someone that I’m not being too loud. Is it better? Can anyone can everyone hear me? Okay, awesome. Moving on to the next slide. So I’ll be covering the machine learning paradigm. So there are many machine learning paradigm, the two major ones are unsupervised learning and supervised learning. Now, the unsupervised learning paradigm is to actually find the structure in the data. The way you do it is by using clustering or PCA, principal component analysis, all these techniques are basically used if I’m structuring the data so we don’t have any label data settings in the unsupervised learning. Now, the most widely used machine learning paradigm is supervised learning, where we try to find a mapping between the features and the label is based on the label data set. So there are two types of supervised learning one is regression and the other is classification in regular We are trying to predict the real values output. Whereas in classification, we are trying to predict a category. This will give you guys an example, let’s say if you are trying to predict the weight of a person, then it’s going to be a regression example because it can take n number of values, giving values in a single answer. Whereas a classification problem would be basically a yes, no kind of a problem. It is a binary classification problem. Okay, moving on to the next slide. So I’ll just cover how the journey from data to decision takes place. So we have some data and how do we make decisions based on that data. So think of this data, we start with first finding insights out of the data by doing unsupervised learning techniques, like clustering, PCA is doing some kind of analysis like a scatterplot, histograms bar, or whatever. So basically, the idea is to understand the data and get insights out of the data. Once we have the insights, we use this insight to come up with features. And another way of coming up with features is domain knowledge. Now, based on these two factors, we come up with our features which we fail to feed into our model. Now we try out lots of different models. And we compare it compare all these different models based on a specific market could be accuracy, it could be me, or me, it could be any other metric. And based on that metric, which you’re trying to optimize for, we come up with the best model that we use for prediction. Now, once this prediction is made, based on this prediction, we try to optimize our business objective by honoring our business constraints to make our decisions.
Once we make our decision, we will have this historical data that whether the decision that indeed was a right decision or a wrong decision and this, this information, we feed it back to our data, which is again used in our model to improve its accuracy even further. So this is the entire spectrum of out of date from data to make decisions. Okay, so guys, in case if anyone has any questions, please type into the chatbox, I’ll try to answer them right away in case I’m not able to answer them, I’ll try to address them at the end of the session. Okay, moving on to the next slide support vector machine. So, the support vector machine is a supervised machine learning algorithm, which can be used for both regression and classification. And it works really well with a small data set. So in today’s webinar, I’ll be just going over the classification part I will not be covering delegation Okay. Now, think of these classification problems, we have new circles and the red circles and we are trying to come up with a line that best separates these two classes. So, if you think about it, we can come up with many lines like this, I think of this line could be one line is going to be one line, this could be another line. So these are all our lines, we could actually split the data perfectly. But let me ask you this question to you guys, which one which line is the one which is the best line that separates these two classes? Now, if you think intuitively, this would be that line, which actually separates these two classes in the best possible way. So, think of this line as the median of the road and we are trying to build a road as wide as possible so that it separates these two classes in the best possible way.
Now, this is what the end goal of svms is to come up with this line, which actually separates these two classes in the best possible way. As I was telling you about this line, why this line is not the best line lines wise this line is not separating the two classes in the best possible way. Think of it like if you are trying to build a road using this line as the median road. The width of this road is not as broad as the one that we saw in the previous slide. I just got a question. Um Ah, okay. Now think of this Another line that we saw, again, for this case as well, the road is not as broad as you’ve seen in the previous slide. So this is that road that we are trying to build, which separates the two classes in the best possible this is at the widest throats. Now, instead of calling it a road, we can call it a margin that is a more technical term for us. So we are trying to make sure that we get the widest margin that separates these two groups. So in other words, what we’re trying to do is we are trying to maximize the distance between these boundary points. And this night, we don’t make sure that that margin is is as wide as possible. And these points which are at the boundary, are known as support vectors, which are trying, which are helping us to build that float. So to summarise, what SVM is trying to do, it’s trying to come up with that perfect line are in a hyperplane, or a plane, which tries to separate these two classes in the best possible way, by building the widest stood between these two ripoff states with the help of support vectors. Okay, another thing is, he is trying to maximize this module, that is, the whole idea is we want the switch to be as natural as possible. So it’s also known as maximum margin classifier, that’s just another name for it. But the whole idea is to come up with it hyperplane align on a plane, it actually separates these two groups in the best possible way. Now, why I’m saying line plane and hyperplane in the same line because the idea is basically if it is a 2d data, then we would come up with a minus if it is and date, I would come up with the n minus one kind of a decision boundary. So in for a 2d kind of data, it would be a line for 3d data is going to be a plane. And for our 3d or more than that dimension of data, it would it’s going to be a hyperplane that separates these two classes. Okay. Now, this is the main idea behind SVM. Now we’ll move on to the objective functions, we’ll try to look at look under the hood, what goes on? How does it actually come up with this kind of hyperplane? How does it build that? Okay, so, in this particular example, what we’re trying to do is you’re trying to classify these blue lines, you lupines blue samples with the red ones, red positive samples. So again, we’ll try to come up with a line which bits separate, this separates these two classes. And you will try to build a road, which is as wide as possible between these two classes. Now, the idea is, let’s say there is a vector u, which is anywhere on the surface, and you want to know whether this vector u belongs to the positive samples or to the negative samples. And this is a problem that we haven’t had so far is that what we do is, we come up with another vector called W, which is perpendicular to this meaning of the line. And the project, this u vector, onto w says that the dot product of u and w gives us the distance, how far along it’s there on the right side or how far along it’s on the left-hand side.
So just to make sense of everything, what I just said is, what we are trying to do here is whether this issue vector lies on the right side of the line, or this new vector lies on the left-hand side of the line, the way we are doing it is by taking a dot product between u and w. And then just bringing the c value back onto the other side and representing it as C equals two minus six. So this is just a mathematical trigger doing it but everything is as simple as just trying to see where the U lies, whether it’s a positive sample or a negative number. So for any new sample, which is coming, we can actually find out whether it belongs to the positive class or the negative class by taking a dot product. So the only problem that we have here is we don’t know what W is and that is what we will try to find out okay. Now, as I told you that we are trying to come up with this slider stored using these two classes, What are we essentially trying to do is we are going to put some constraints that these points cannot be on the road. So for positive samples, they need to be at least plus one distance from that median line. And similarly, the negative sample should be at least minus one distinct from the median line. Since this is what the constraint is, if you take a dot product between the W vector that we introduced in the previous slide with the positive samples, then this constraint would be should be greater than equals to one should not be equals to zero, it should be greater than equals to one. And similarly, for negative constraints, it should be less than equals minus one. So, what this entire slide tells us that if, if it is a positive sample, it should be on the right side, it should not be on the right side, if it does not be on the right side, it should have to be like some more distinct from the right side. So that’s why it’s thought of this black line and zero. And think of this shaded line on the positive sample side as one side, it cannot be within this distance, it has to be on the right-hand side of that cheated line. So, that’s why I’m putting that constraint of it. And the same goes for negative samples. So these are these two constraint as we add in splines, in order to come up with our objective function, because we want to load as broad as possible. Okay. Now, this is just showcasing that if you are on the right-hand side, you will be given n equals to one if you on the left-hand side, you will be like less than equals to minus one. If you are on the line, the value will be zero. For all support vectors, the value is going to be equals to one. That is what’s getting represented in this. Now moving on to the spot where we are trying to combine these two constants. So the idea is he has got all these constraints, which you don’t like. So what do you want to be honest, simplify things for us that really simplify things for us is by combining these two constraints in such a way that will become just one constraint for all samples? The way we do this is by coming up with another variable called by which is positive plus one for samples for this particular inequality for positive samples, where it’s minus one for negative samples. And we just multiply it on both sides for these cons for these politics, currencies, and the negative constraint, and we end up with this particular constraint. So basically, what we have done in this burger sliders, we have combined the two constraints inequality into one and this, this is what it is. Now, another thing we can say that is, for all support vectors, this is going to be exactly close to zero. The reason why I can why we can say that is if y is equal to one, this value should be equals to plus one. So if you subtract one from minus one becomes zero. So this is how they’re combining our constraints and coming up with just one inequality.
Okay, so now, I won’t go into depth with this first slide, because let me just go to the next one. And maybe just give you a broader idea of what we are trying to do here. So as I’ve mentioned earlier that we are trying to maximize this distance maximizes the margin between the two classes. So, the way we represent this distance this margin comes out to be two divided by magnitude of was I explained earlier system vector perpendicular to the median line. Now, the idea is this with this margin, that is what we are trying to maximize to our W. Now, the problem with this one is, when you try to maximize this, we will take we will try to take a derivative of it. And since W is in the denominator, it’s going to get extremely messy and difficult for us to worsen. So, what we can do is we can just move w into the numerator, such that we can work this entire problem of maximization into minimization. So, what we’ll do basically is we will, as I told you, that will move the W into the numerator, and then we’ll take a square and divide it by half. The reason why we divide by two and take a square is for mathematical convenience, just to make our mat. That’s how we do it. So we were trying to maximize the width and how now you see how we have transformed our maximization problem into minimization. By just moving w from denominator to numerator. This is one part. The other part was the constraints that we had come across in the previous slides. So now we have our objective function, we have constraints. So this actually turns out to be a constrained optimization problem. So we have an objective function, we have a constraint, this is what represents. Now, the way we solve this constrained optimization problem is with the help of Lagrange multipliers. Now, the Lagrangian multiplier is a concept that comes from optimization theory, which basically says that, if you have an optimization function, if you have an objective function, and you have a set of constraints, you come up with a new expression, which combines these two things. And then you don’t have to worry about your constraints. So, this is the whole idea. Now, what we’ll do in this one is, we’ll come up with this equation. Now, this we’re learning division is nothing but if you look at the first one, it says the objective function, if you look, look at the second term is basically the constraint multiplied with alpha is the Lagrange multiplier that we have added in this. That’s, that’s everything. And that’s, that’s the only thing that is. Okay. Now, if you think about why there’s a negative term, the reason why is a negative term is that if this constraint is violated, then this value is going to be negative. If this value is going to be negative, it’s going to add up to the objective function. But since we are trying to minimize the objective function, it has to pay because it broke a constraint. So that is the reason why we have a negative this was the second term. Okay, so yeah, so so this is how we have come up with the primary function of support vector machine. So this is also known as the Lagrangian, or you can call it the primal function for support vector machines, or anyone with any questions, again, feel free to drop into the chatbox as we do. Moving on to the next one. Now, we’ll try to solve the primary demand. The problem that we see with primal is, there are a lot of variables that are there in our objective function, if you look at W, which is something which is unknown to us, similarly, alpha is again, something that is unknown to us, W is again, unknown to us. So these are other variables, which are unknown to us. So what we’ll do is, we’ll take a derivative of this final function with respect to W and V. And try to see what happens. Now when you take a derivative of w, you get a value of cabling that turns out to be this. Now, this is an interesting result, if you, if you think about the reason why this is interesting, is that w which was just a perpendicular normal vector to the median is actually a linear combination of support vectors.
This is a very interesting result. Now since we know W, we can use this inner decision rule to find out if a new sample comes in whether it will belong to the positive class or the negative. So we have solved for W, this is great. Now moving on to the other part of the V variable. If we differentiate with respect to then we get this out. Now what we’ll do is we will use this value of W and input it back into this expression. And that is how we can work our primal function into dwell function by inputting these values back into the primal objective function. And we get the dual function like this. Now, if you look at this dwell function, the good thing about dual functions is it just in one variable, it is actually a dot product of the support vectors. And it’s just one variable there is now there is no V is just a dot product for the support vectors. It is again a very beautiful concept that actually came out that we didn’t realize could happen. Okay, so moving on to the next one. So now we have this objective function that we have to minimize. And these are the constraints that we have to make sure that it’s a budget. So now we have completed the SVM objective function. So this is the objective function that you’re trying to minimize. As I told you that we already find the value of W, we can just say plays this w value in our expressions if we do that, then you will be able to find out that if a new sample that comes in whether it would belong to the positive class or negative class just by taking a dot product with the support vectors. Okay, now, since we are done so there are a few points you should consider. First is support vector machine is a constrained minimization problem, it had a constraint and an objective function, it’s a constrained optimization minimization problem. Now, another good thing about the objective function, it’s a convex function what I mean like contracts it has a global minimum whereas, some machine learning algorithms have nonconvex functions like a neural net, where the suffers from local minima you get, you might get stuck into local minima and you won’t be able to optimize for it. So, so, so, basically there are a few cases where new where SVM works better than neural networks as Now, the other point is just that it just does a dot product of support vectors which helps us to get the virus to between different samples, Okay, moving on to the slight variance. Now, in a separable case, when there is linearity like in this the two classes are linearly separable, we can build a vital as wide as possible. Now, think of a non-separable case. In this particular case, what’s happening is, look at this, this blue point which is on the other side of class, we know the slot sizes on the other side of the black line. Now, the thing is there, so, we have two options. One is we want to build the road as wide as possible. But we don’t want to miss classify also, that is one thing I don’t think is you can build a very narrow road by being misclassified misclassify. So, what I’m trying to explain in this one is we can come up with very narrow roads, where we won’t want to miss classify any data points, but the problem is this will be there in that training set in that testing set that narrow that bigger that narrow road, there will be a lot of misclassification. Another approach could be to come up with our very large drawers, my drawers, but it’s few misclassifications in our training set. So, that is what this is representatives we can come up with some slack variables, which would make sure that you can have a video road that few misclassifications are Okay. Now, because of the slack variables, they are Newton that gets added to the final objective function as you can see, which makes it really really scary. But, the good thing about this one is that well objective function, it kind of remain the same for both linearly separable case and a non-separable case The only difference is in the constraint earlier alpha I value was just it has to be greater than equal to zero. So, the alpha value is within the range of zero and see what c represents is that if you if the value of C’s is high C likely a penalty term if the penalty is pretty is extremely high is very high, then you would not break any house, you will you won’t try to misclassify and you will actually end up with a very small margin. If the value of c is not very high, what would happen is you would come up with a large margin such that there will be few misclassifications. This is the same c parameter that gets used in SK learns library. Okay. Now econometric The idea behind Carnage I guess. It’s like if we have a nonlinear data set and we are trying to make use of a linear classifier the accuracy will not be good. So what are our options are one of the options is to go build a new unit which has nonlinear components in it. The other option that we have is to transform our input space into higher dimensions such that in a higher dimension, this data become linearly separable. Let me show it to you guys with an example. So think of This data set is in two dimensions, which is not linearly separable. What I mean by linearly separable is it’s not, it cannot be separated by this one line. So what you’re trying to do here is we are trying to separate out these blue lines from the red lines. Now, how can we do it? One way, how can we do it is basically by transforming this data set into a higher dimension into a 3d space such that it looks like this. Now, we can come up with that line, which acts on a plane that actually separates these two classes. Now, this is, this is how we do it in all cases, like in cases like illustrated think in cases of logistic regression, this is how we’ll try to do we’ll try to transform the input space into higher dimensions as data becomes linearly separable. Now, think of this that the linear classifier is stimulated, I like this pair, W’s are the weights and x is the input space. So this is when this is the reason that we have we can only build these many numbers of models. But if we transform our input space into hydrogen, such that x became p of x, then we can build these many models that would include the nonlinear models faster. Another idea is to actually transform your data resources into higher energy by just transforming the input space into high damage, Murphy effects. This is what linear classifier does for us now, not how this would translate in Kashmir, that is what we’re going to tackle next. Now that I’ve told you, this is a linear classifier. Similarly, for SVM, we take a dot product between the input space dot product of the support vectors, matrices, let’s say we want to introduce non-linearity.
Or if you want to transform the input space into higher dimensions, what we do is we’ll do a similar thing, we will try to transform the input space into high dimension by going over to the fee of six. Now, not just think about it for a minute, we have, let’s say some data in some dimension. the thing for data, now we are trying to transform it into a higher dimension, and then take a.org if you think about it fairly hard, you will feel that this would be extremely computationally expensive. And the reason behind it is this is because first of all, we are actually transforming the input space into the high dimension. And then we are taking a dot product on top of that. So here comes the real conundrum. What this kernel does for us is rather than v transforming our data into a higher dimension, and then taking a dot product, the kernel trick does it for us in the background. So as you can see, instead of doing this p of x into p of x of J, we can come up with some kind of a notion of a kernel function, which does exactly the same thing. What this is trying to do, like first converting the input space into high dimension and then trying to take a dot product done that the kernel function does this in the background for us, so we don’t have to worry about it. Okay. I’ll just show you a small example. The kernel space in the binomial kernel in a separable space, or it will be showing is it’s if you look at it close, you’ll see that there is a straight line, but it’s a very narrowed now and under example, that are they on the boundary are known as the support vectors. Now, in this specific case, let’s say we use a binomial kernel, which is a nonlinear economy, what it will do is for the same classification, it what we’ll try to do is, first of all, it’s not going to be a straight line. As you can see, it’s a curvy line basically, and I support vectors has also changed. You can see this is this is an extra support vector. Now, another thing to understand in this one is places where it can go wider, it will become fights, where spaces where it cannot it will just remain the same as what was there in the previous case. So you see how the binomial kind of is better in some sense from the linear kernel when the data is not linearly separable. So there are a lot of kernel functions that are available in that polynomial kernel Gaussian kernel RBF kernel, it’s not, not lots of concepts are available so, so the way the way it works is basically it’s like this, we start with a linear kernel. And in the linear kernel doesn’t give us really good accuracy, we move on to a polynomial kernel of degree two, degree three degree four and so on. If we don’t get good accuracy, in polynomial kernel, we move on to the RBF kernel RBF kernel is radial basis function, which comes from Fourier series. So, the whole idea in any machine learning algorithm is to match the complexity of the data, the complexity of the model. And based on these kernels, they come from linear to polynomial to RBF, we are trying to increase the complexity of a model in order to match the complexity of our data. Okay. Now SVM hyper parameters, see parameter I’ve already covered, it’s a penalty parameter for large value of c is going to be a small margin. And the reason is, if the penalty is high, you will not you will not try to make mistakes, you’re not try to build houses, the idea is you’re going to build a very narrow road. Similarly, the penalty is low, what would happen is you are going to make a very wide road. And the reason why we do that is because there will be some few misclassification but the overall fit would be a better one. And that is why there is no thumb rule to choose for a large value of C or a small value, it all depends on the data. Now parameter gamma is specific to Gaussian radial basis function. We’ll cover this one in detail for this one in the code.
Okay, that’s okay. Is anyone having any questions before we move on to the coding part?
Okay, I don’t see any questions. So seems everyone is happy. Let’s move on to the coding part. Now what you have done in this is you have taken a data from Cagle is a breast cancer Wisconsin data. It’s a data simply that it doesn’t have any noise has around 500 rows. Now, the problem statement for this data is you’re trying to predict as a person has conscious or not, this is a problem statement. And these features, they are derived from an image standard image, we have come up with 10 different categories of information and for all those 10 different categories, they have the mean standard error and the worst features. So basically, that there are this diagnosis now that what we are trying to predict based on 30 features. So those are are these features are mentioned here. You can look at it. Okay. Okay, now, we are reading the data like this like through so basically I have a bunch of libraries here. We have NumPy, pandas, matplotlib, Seabourn. So NumPy is for numerical for arrays and all use NumPy. pandas is more for data munging and the processing of data. matplotlib is again for plotting images, photographs, c one is again, a starting image isn’t drawn. sk learned if I get one library for models for pre processing. Okay. Now in the DF data frame, we are reading the CSV data dot CSV. So we’ll begin we’ll start looking at the data like how big is the data, how many columns it has, so we’ll just do a shape of the data. Now after that, we need to get to understand the data we’ll just have a quick look at the data by calling the head function on this data frame. If you look at look at it, you’ll see that there is this ID column there is diagnosis. So this is our target variable and represents malignant tumour whereas B would represent the benign one. So just for your information malignant one are the ones which are the bad ones. So other columns, so these are like some other columns, which I have no idea about. And the good thing about it, as I say, you don’t have to be having any idea about these columns. His stuff is your model that will have final patterns for you. So you can see there is this another column called onion 32, which seems something weird with this the busy but what is this maybe later. So just moving on. So we will see how will the name of the column size you see that there is actually a pattern to this radius, radius, texture parameter, and all these, either the mean value similarly, we have radius se, texture, see, as I have told you, like for 10 different categories, they have come up with mean standard data and the worst parameters. So that’s why we have these three parameters. Now we’re gonna start with that data exploration part. And we are going to prepare the data from that. So first of all, let’s start with the missing values, how many missing values we have in this data. So what we’re doing here is we’re just calling a isnull function, and then summing it up and dividing it with the shape of the entire data to come up with a percentage value of which our columns have missing data. Now, the problem is, so you see here, that unnamed 32 column is having 100% missing data, then somehow just got an edge. So this doesn’t have any information in it, this doesn’t have any predictive power. So it’s safe to remove this particular column. Just to give you guys a little more understanding of the missing data. So in case if they have been columns with missing data with more than 60%, of missing data, then generally it’s a good idea to drop it on T unless you have a strong reason to keep it in the data. In case if you want to keep it there are ways you can actually keep those high missing columns by building another column as a dummy variable for those high missing columns.
Okay, now, moving on to the show, again, looking at the shape, the shape is around the same, there’s nothing changed. Now look at that ID column. Now, one of the things that you saw earlier that once we when we look at the data we have to make, we have to look at few other things, you can what things you can either take a look at the ID column, it has all unique value, which is equal to the shape of the data. What this represented basically is it doesn’t have any predictive power. For every row, this value is different. So now, we have these observation that the ID column has all unique values. So there’s no pattern, name, three, two column has all Nan values again. So basically, these two columns is of no use to us, we can safely go ahead and drop these columns. Now this is the way that we can drop these columns. Baby drop is basically is we call this function drop onto our database. And we pass on the list of columns that we want to drop. x is equals to one represents kind of whether x is zero represents rows. And every time when I’m working with it, I get I forgot, I forgot what exactly that axis me. So what I generally do is, I do a Shift Tab to actually come up with what it represents. Now, you can see the signature of this particular function. And then if you scroll down, you will see here what this each of these parameters means they can look at this one, you can see that for one, it means column. For zero, it would mean a row. So with this particular df dot drop, you can actually drop a row or column depending on giving the access parameter in place equals to true represents that you want to drop this column from these data frame. If you don’t include in place equals to two, what it will do is it will give you a new data frame with this, excluding these columns, but it won’t drop it permanently from the from this data frame. But I want I don’t want this data to be used in future so I’m just dropping it permanently by calling this implicit questions. If you look at the shape, it has changed earlier it was 33. Now it has become 31. So this is just to verify that the columns have been dropped and moving on to the target feature When you start working on a problem, we have to see what is the distribution of its target feature. And the idea is to understand whether it’s a balanced data set or a imbalanced data set in an imbalanced data set, one of the classes will have high number whereas, the other category will have low numbers. And if you look in this specific case 62% belongs to the nine class and 37% belongs to the malignant class. So, this is more more or less, it’s a basically a balanced class, if this division would have been like 90% belongs to one class and 10% belongs to another class, then that would might have been an imbalanced unison. And when there is an imbalanced data set, like again, it’s a whole different world like you have to use different metrics, you have to have some kind of over sampling or under sampling techniques to handle those scenarios. Now, you can see that we have 350 seventh the ninth cases and one in 12 million cases starting with very variable identification, now, what we are trying to do in this case is we are trying to understand our data type of the variable that we are working with, and this is really important, because think of the cases where if it is a categorical data, it cannot go directly into a model like think of this diagnostische it cannot go directly as an object into our model, because our model doesn’t expect string values, it has to be a numerical logic. So, this is an important step is where we actually change our datasets. Now, the way we are going to handle it is we are doing a label encoding for our diagnosis the target variables by mapping a value of m malignant cancer to one and B to zero. Now, you see here that the diagnosis has changed from m to actually one
Okay. Now, this the DF dot describe what this basically does for us is basically it gives us different metrics about the numerical columns, it will tell us about the count the number of recorded times the mean value of that particular column, the standard deviation, the minimum, the maximum and other features. Now, from this, you can have an idea of what kind of features we are having. So, if you look at radius P and the maximum value is 28, for this one, whereas, for area mean the maximum value is 3500. Similarly, for smoothness mean the values in this much she gives, you can just see that there is so much of difference in the values. Now, now, one thing to understand from this is when when you see that there is so much of difference in the value, it basically means that we cannot compare these two values and we have to come up with some kind of feature scaling, so that they became comparable. So there are two types of feature scaling that are available. One is called normalisation and other is standardisation and normalisation. We actually bring a column in a 01 escape, so that minimum value will have a value of zero and the maximum value for that column will have a value of one and all the values will lie within that range. And similarly for standardisation it’s more like finding the Z scores. So basically, the way you do it for sanitation is you subtracted from the mean and then divided by the standard deviation in the output that you get basically is how much more than a standard deviation you did or how much less than how much more than the mean that you are or how much less than the mean that you are. Just Just to concrete this particular understanding, I would say that if you are trying to compare the sales of a TV with the sales of Apple, you cannot compare sales of these two things. Don’t even compare it by bringing them on the same scale and then compare that whether it was it was bought more than the average or as an average or how we are going to do it. Okay. Now coming on to the univariate analysis, non univariate analysis, they’re going to be looking at histograms and box plots. And the reason why we didn’t do this is basically to get an understanding of the missing values and the outliers. We will try to eyeball these But from our from our previous code, we know that there are no missing values. So, we are safe on that you just be looking out for outlier values. Now, here this loop I’ve written it just basically get all the column values and assign some colours and typically its ties to okay this tries to come up with a distribution of data for all these different features. So, you can see for radius means this is how the distribution it goes from five to 30. Similarly, for texture means just to this, it gives us a value of what is the mean value and what is the standard deviation and just to get an understanding of the data. So, so, you can see these all look more like normal distribution. So, this is a very well behaved data set, there are cases where where it’s not normally distributed or it is right skewed or left skewed. And in those cases, we have to transform these features the very transform is basically by taking a log or by by by square root cube root and the idea is let’s say if we have a right skewed data set right a skewed feature, that tail is quite long and complex on the left hand side and then if you take a loss transformation of that, it will become normally distributed. But the only fully constrained with loc transformation is it cannot take values, it cannot take negative values or values which have zero. So, in that case is we can use cube root or square roots.
So, the other two transformation that can also be used. Now, moving on to the others, other features, so this is again okay. So, this is just showing the boxplot. So, what box plot tells us basically, the meaning of it basically use it is to get an understanding of how the distribution of data is what are the mean values, what are the outliers. So, you see, this whiskers are the interquartile range, and anything outside these whiskers is known as outliers. So you can see radius screen has some outliers, similarly, texture mean, and also some outliers, fenomena some of them. So again, outliers can impact your prediction. So we have to see that Do we really need to fix for those outliers or we don’t have to fix for it. So we have to take a call on sometimes when the outline are pretty huge again, log helps in that scenarios, think of think of a really high value like 10 to the power 10. And that generally happens in sales like for sales from Melbourne to Sydney is pretty high rate of sales from Melbourne to Perth is not that high. And then what will happen is it will actually skew our data set. But what law will do for us it will bring them closer. So now I got into the power 10 will give us a value of 10. And that is the power of thinking not exactly squishes stuff, all the values closer. Okay, now, so basically the information that we’re getting from this is there are a few features which have outliers, we’ll see how to deal with these outliers or do we have to deal with the deal with these outliers or not. Similarly, I have gone over to that standard in a variable. So the last another 10 features that are working on the way we select the features is using an AI NOC and then passing the range for all those columns that you want to select all the rows. So this represents all the rows whereas this represents the column index, which all column I need from this data frame. Let’s see that again in this one again in the FIBA distribution of data looks quite well behaved. So we don’t have much of a transformation that needs to be done on these data sets. Similarly, there is a boxplot we have done for standard error features. Again, as you can see, there are quite a few of outliers will be a call that we have to work on these outliers are not to nollie we have done a distribution of largest values. We can see that most of them are quite well behaved so nothing major that has to be that like in a sense, like you don’t have to crouch down Many of the features, so in order to make them normal that time similarly, if you look at the boxplot, again, we have the features of the outliers for the watch spot, okay. Now, the input that we get from all this processing is all features are normally distributed and they are no significant outliers. Like, in essence, I’m just saying that there are no significant outliers, because we can see that there are some outliers, but we don’t know whether they will impact our accuracy or not. So, we have to check our accuracy first and then see that there are they are not impacting our prediction or not okay. Now, what we have done in this particular graph is we have tried to see the distribution of each feature with respect to the target variable. Now, what is what would it be a good feature a good feature would be one where when there is a clear separation between the two classes, idea is if you look at the radius mean for low value of of radius mean, we can easily predict the benign cancel similarly, for high value of radius means, we can easily predict the malignant target. Now, this is a good feature whereas, if you look at the tester mean, if you look at a two distribution of data for two different classes, you will see that they are quite overlapping.
So, this in turns turns out to be not a good not a good feature because there is no clear separation between for what value of textures mean it’s going to be a malignant cancer is going to be a benign cancer. Similarly, for parameter mean, you can see that for low value of parameter mean it’s going to have a benign cancer and then as for high value, it’s going to be a malignant cancer. So, so, just by eyeballing these features, you can see that which are good features to work with. Similarly, we can look at this area mean also look seems to be a good feature smoothness mean not So, good feature that’s yeah just this one part maybe compactness could be a good feature then concavity again mean it might not be a good feature. concave point means, it could be a good feature for lower values, it has quite a high number of the nines whereas for high value of concave point mean, you have malignant cancer See, look at this symmetric means. So, this is again a case of not So, good feature because, because they have an overlapping distribution, which basically says that for for if he said select any value of symmetric mean, it could be either malignant or benign. Now, again, seeing this for fractal dimension as well, it’s not so good feature. So, by looking at all those plots, what they concluded is, are these columns radius parameter area complexness concave, these can be used for classification because largest value of these parameter tend to show correlation with malignant features turned into mush. And it goes the other way around also for lower values of these parameters tend to be at in correlation with benign tumours know, we can we can actually ignore other features because they don’t have that much predictive power. Okay. Moving on to the standard error features. So same idea has been applied on this particular feature as maybe again, trying to see that, okay, so you see, for this one, this particular case, it’s not so good for from for lower value, it’s both it’s both malignant and benign. Similarly, I mean, texture disease is a pretty bad feature. Like, there is no clear separation since overlapping distribution of both classes. So the whole idea is to is to have a feature where the where there’s a clear separation between the two distribution of data that would be the best return that we can have. So the exercise is just to actually come up with features which are relevant for our prediction which have some predictive power. Now you can see again these are not so good features either notable features maybe concave point se would be a good feature for lower value or just malignant features. So let’s let’s keep concave point AC it seems to be a okay feature. Now, this is what observation is concave points can be used because for lower values of concave point se, it can predict the benign q1. Similarly, we have a window and did the same thing for voice type of features. Okay, and again, you can see that this is pretty good feature radius for because for lower values of radius worst, you can see that it shows a benign tumour, whereas for higher values, it shows up, malignant tumour many texture versus not so good feature family that was, again, a really good feature to work with. And that is how it goes. Now, if you look at this concave point was this is this is a really good feature, the reason why it’s a really good feature, because you can see there is a kind of a clear separation between the two classes. If you look at it, you can see that for lower values, it’s the benign cancer for high value and the malignant cancer. Okay. So, these are the features that are identified or can be used in from this set of features where there is a correlation with the malignant or benign tumours other features are not needed.
So an observation summary, we can sum it up by selling that five features from the main features we will be considering one features from the SE features we’ll be considering, then five features from the verse features we’ll be considering. So in total, what we have done ours out of the 30 features, we have brought it down to just 11 features, which we think have good predictive power. Now the way we select these columns is I’ve just taken the name of these columns and put it into a new data frame df one. Now, you can see that we have 12 columns in this new data frame df one with one additional column as that target column. Now, this is just looking at, again looking at the data trying to get an understanding of the data. Now, another thing that we are going to do here is we are going to find correlation between these features. The idea of correlation is if they are two variables, x and y, and if they are perfectly correlated, then it means basically for for increase in value of x, y also increases. Similarly for negative correlation, perfect negative correlation for for increase in value of x, y decreases, now you’re going to get a correlation. And if a feature is not correlated, the value will come out to be zero. So for any change in value of x, there won’t be any change in value of y, or there is a random change in the value of y. So when I talk about correlation, here, it’s it’s about Pearson correlation, it’s time to find a linear correlation trying to find out a linear correlation between the variables. Okay, so this plot represents the correlation between different features. So you can see it’s it’s a, it’s a matrix of these columns. On the on the left hand side, we have all these columns. And on the bottom, we have all this column, and then it basically checks for the correlation with each other. So you can see that radius mean, has a perfect correlation with itself, which makes sense. But it has a perfect correlation with parameter mean as well, which is not so good for us. Because the reason is why we don’t like call it features. Basically, if one of the effect of the feature is already there in the model. We don’t want another feature with same effect in the model. It’s an unnecessary column that we don’t want to include in our model. So the idea is to look at all these different correlative values, and then finally dropped, but before dropping them. What we’re going to do is we are going to plot these features There’s a bit one of the so you can see this is scatterplot what it represents it’s basically whenever these increases in radius mean there is an increase in radius first Also, you can see it’s not perfectly correlated, but there is a very high correlation between these two features. Similarly, we have done the scatterplot between the parameter was and the radius mean you can see for this one by just by looking at the plot you can make out that there is a high correlation between these two features. So, what all features May we have? Well, we have identified as high correlation and we’ll still try to come up with these a scatterplot and try to see that it is it really there or is because of one of the outliers that because we are seeing this high correlation This is just a visual inspection to make sure that the correlation exists and it’s good to drop these columns now, hello standing agreed it’s been in parameter mean has a perfect correlation you can see this perfect correlation represents that for increase in value of radius mean there has been increasing value of permitem. Same for this one as well it looks like kind of a linear only. Now, go ahead, what we are doing is instead of dropping we are only using features which actually make sense which are not correlated with anything. So, from 11 features, we have come down to just three features that we’ll be using in order to make the prediction for a diagnosis no fee again look into the correlation matrix, you will find that the highest correlation is around 73 between the radius mean and diagnose it is actually good because diagnosis is our target variable. We don’t want correlation within features. But coalition we read diagnosis and various means good Okay. Now, if you look at other Coalition’s style maximum is around 60 which is fine 60% is okay. Now, moving on to the modelling part. So, what we have done here is we have assigned a value of y to df to the diagnosis and x we have given to these three different features that we have identified that will be used for the model. Now, as I mentioned earlier, that these features are a different scale, they’re in different units and we have to bring them together. So that the became comparison. So, as I mentioned that there are two ways normalisation and standardisation. I’m doing a standardisation in this specific case, so anyone can go with this normalisation as well. So it’s all depend on that on a data set. Now, the way we are doing standardisation is by calling the standard scalar function from from the escalon library. What is fit transform to for us basically it’s twice to find the mean for each of the features tries to find the mean of each of the features and then divided by the standard deviation for each of the feature Okay. Okay. Now, this one fit transform method does for us. So now if you look at the x, you will see that all are in me, let me just show to you guys what x looks like now.
Okay, x is the stop. It’s not a data frame anymore. So, you can see how the first row how these values have changed. Now, these values are not in that level, they have actually been transformed and they have become standard deviations. Okay. In this step, we are just trying to split the data into training and testing. And the way we do it is the calling train test split function is again we are getting a circuit law essay stances, the amount of records that we want in our in our testing set. So 30% of the data will be allocated for our testing 70% of the data will be used for our training. And then that state is there to make sure that the next time when we run this, it is written in this way only, like, there won’t be any, if you don’t use it as a mistake, every time when you try to run this, you will get different results out because it will spit in a random way. But if you assign a value that necessarily goes to zero, it’s always going to get a same result. This is really important in order to reproduce your results. Now, here I’m calling a logistic regression. You can see that logistic regression is another function that we are getting from psychic learn we are trying to fit our training data set conduct then we are predicting then once we have fit our training data set based on that the output variables, we are trying to come up with this prediction of access. So basically, if you if you if you look down here, you’ll see that the accuracy of LASIK versions is 88%. And then there are a few other metrics that I’ve showed here precision recall effort score and support. Now these metrics precision recall and affinity score becomes relevant when we are working with imbalanced datasets. But since we are working with balanced data set, we can just safely go with accuracy as our metric to compare models. Now, this is the confusion matrix again, this is more relevant cases when we are working with imbalanced datasets, but it just represents how our data set has been divided into different buckets. Okay, now, with logistic regression, we are getting an accuracy of 88% and we’ll try to work with support vector machine and we’ll see how is the accuracy Now initially we’ll start with the linear SVM so we have specified here what kind of kernel that we are looking for by default, it’s value of C which is a default value of one it’s take a kernel value of RBF and there is a degree so so these are the different hyper parameters that you can choose from in working with this particular this particular library nine this one I started with a kernel linear but by default it’s RBF. So with a linear kernel, we are getting an accuracy of 88% which is same as what we were getting with logistic regression. Now again, these part I won’t be covering because it’s more relevant for imbalance data set. Now, what we are doing here is we are trying a polynomial kernel of degree three The reason why I have said degree three is because the hyper parameter in this by default is three so I’m just trying a corner of binomial type and if you look into this, you will see that the accuracy have actually reduced. Now, the reason for this reduction in accuracy could be that even we try to map Okay, let me check. Okay, okay. Okay. So, the reason why there has been a reduction in accuracy is because when you try to map this space into polynomial space that you put a space into peinado space, you might not get linearly separable data. So that’s why the accuracy have reduced to 87%. Now, we are trying the RBF kernel, the RBF kernel is the most complex with the radial basis function. And when we try that we get an accuracy at 8%. Again, which is similar to what we got for logistic regression and what we got for our linear SVM. Now, what we try to do here is so you can see that these are the different accuracy that we have got here. What are you trying to do here is We try to tune the hyper parameters. So CS one parameter then gamma is another parameter. So, C is parameter which is more relevant to what kind of margin we want like how wide the margin we want do we want a soft margin or do we want a small margin or do we want a large margin. So, for high values of C, you will get a small margin for low values of C and you will get a bigger large margin because C is a penalty term if you put more penalty you will get you will get a smaller Okay. Now, gamma is a specific to RBF kernel, I’ll go to the idea of what gamma does basically. So, in gotchas in Gaussian kernel now RBF kernel basically what gamma represents is the inverse of the variance. So, how much influence and how much influence a support vector will have
okay. Okay. So, so, basically based on this we actually do a grid search CV, we run through all of these parameters, and we try to run support vector machine for all these parameters, that is a scoring mechanism that we have used for comparison accuracy and based on that we come up with best parameters, you can see best parameters comes out to be this now, I think putting these input parameters in the model, you will see that the accuracy is improved to 89%. So, so, you can see now, this is how you can improve the accuracy from 88% to 89%. Now, if you further want to improve the accuracy there are other machine learning algorithm. Now, you can try like new networks in a forest and all those things, but we have done till here. And finally, you can just import a pickle file you can just dump this pickle file into into a file and then it can be used into other production software’s for future predictions. Okay, I think we already done with it. I’ll just take one more thing. I’ll just show you guys what we really do. This is what the essentially doing. I think this is a clear representation, the end one basically, what we really do in the end the answer if the only question please go ahead and shoot.
So we’ve started with the q&a session guys, you can go ahead and post your questions in the chat group. Rashad will take how many as he can in the last 10 minutes and I’ve shared the feedback on please feel free to fill it at the end.
So to shut there was this one question that somebody posted to me directly?
He said that please tell me about regular regularisation.
Regularisation is is an entirely different concepts. The idea behind regularisation is it’s like, whenever we are trying to fit a model onto our data. We are trying to match the complexity of the model with the complexity of the data. If the data is linearly separable, and we are trying a linear models, then you’ll get a good accuracy. So the thing that you do is you split the data into training and testing and then you split your data into training and testing. And you test your data on to the training set, your accuracy is coming out to be high. But when you are testing it on to your testing data set, and the accuracy is low. In that case, it’s a case of high variance. Your model is overfitting the data, or it’s time to memorise your training data set. Now, the other other other way it could be like your training accuracy itself is pretty low. And in that case, you’re This is a case of high bias. Now what regularisation does for us is basically let’s say our data is overfitting. It’s trying to memorise your training data and what regularisation will do is there are two types of regularisation l one type and l Type One is classes and l two is registration. What these regularisation term tries to do is it tries to minimise the coefficients of your model such that your model become model should not be able to memorise your training dataset and it, it should generalise onto the testing set. So, that is the whole idea behind normalisation Okay, there’s another question How to decipher between standardisation and normalisation? So, as I mentioned earlier, sterilisation is more like calculating a Z scores and normalisation is actually converting your score between zero and one. It all depends. In case of a standardisation, your values could be more than zero as well. So, it could be a really high value added an outlier, but what would happen in case of but what would happen in case of normalisation is it is everyone is able to hear me properly okay? As I mentioned earlier like standardisation, this calculating a z score where values could be high, high in the sense like it could be more than zero as well, whereas normalisation would make sure that the value are within that range of zero and one. So, I would say try both see for which one you are getting a better accuracy, there is no silver bullet that stylization standardisation would work better than normalisation but generally in the industry people tend to prefer standardisation over normalisation Okay, another question is what needs to be done in case of imbalance problem, okay. So, in case of imbalance data set, what you can do is so, so in case of an imbalance data set one of the class, the number of records of another classes is low. So, idea is, so there can be two ways we can do that. One is we can increase the number of records for the minority class, the way we can do it is, is by over sampling the minority class, we can just replicate the same same records are the minority class so that they became balanced out. And that is one way but the problem with that one is it’s is that you will end up overfitting your data because is, is is actually essentially the same data that you’re trying to replicate. And then there are ways to downsample the data down sample the majority class. Now, that might only be possible in case if you have huge amount of data, where you are able to down sample a majority class. So again, that also not a good way of doing it. Now, there’s another technique called smooth, nice smooth a technique where you synthetically generate your minority class, it’s an over sampling technique where you try to generate the minority class by making use of K nearest neighbours, so that it doesn’t overflow. Okay, so that is for imbalanced data set. Again, another thing for imbalanced data set is you can use a log loss as a metric because it works really well in case of imbalanced data set.
Okay. So so there’s another question how so there’s another question which says, How long is square root cube root transformation is different from scaling. So the way they’re different is think of it like this, let’s see, if we have a particular feature which have a lot of outliers and if you try to normalise this, what would happen is it would be within that range of zero to one, but the outlier value the highest value will be valid, but the remaining values will be very low. Like those will be like point something, but if you take a log transformation before taking the normalisation what that will do is it will bring that outlier value it will squash that entire range of values and bring it to a very small range of values and then when you take a normalisation on top of that, the difference in those value will not be that we will not be huge. And that is the mean difference between scaling and why we should do transformation before scaling. Okay, um, the question on good Okay. Good differences for reading. I would I would say that there are lots of good website there are some YouTube videos for Okay, if you’re really interested there is a really good YouTube video from MIT professor who talks about in detail about that derivation of the objective function for SVM which is a very good starting point if you want to go back. Okay. I, again, got a question How to decide between standardisation and normalisation. So, again, there is no silver bullet, what to choose. Like between this is not normalisation, it totally depends on the data that you’re working with. You can try out both approaches. The only thing is when you standardise you might get values which are not bounded by zero, whereas for normalisation your feature range would always in between zero and one. Okay, I think we are done with all the cushions. Yeah. So yeah. Any other questions that you will do? And
I think pretty much all the questions are answered. Do you want to wait like two minutes or so for anybody else to pop anything?
Yeah, I don’t show babe.
By the way, I love the last slide. It’s amazing.
What? No one, I really do that. That’s the story of my life.
I think that’s pretty much and we’re almost on time. And that’s amazing. Thank you, Rushabh. Thank you so much for this wonderful session. And I’m sure the guys have got a lot of insights to what they would want to try out and experiment with today. And I want to do the thank you from the whole team for Adam, and especially the students as well.
Okay, thanks for the opportunity. I really, really enjoyed working on the slides and preparing for that. That’s great. It’s great.
Thank you so much. We had fun too. Thank you. We’ll end it here on a good note. Thank you.
Yep. Thanks. Thanks, everyone. Take care.
Log on to GreyAtom learning platform to unlock more free content, subscribe to a channel, and hit the bell icon for regular updates.