Guide on Way to approach a Machine Learning problem
Today, algorithms are like buzz words. Everyone is going for learningdifferent kinds of algorithms – logistic regression, randomforests, decision tress, SVMs, Gradient boosting algorithms,neural networks etc.. Everyday new algorithms are beingmade. But Data Science is not just applying different algorithmsto the data. Before applying any algorithm, you must understandyour data because that will help you in improving performance ofyour algorithms later. For any problem one needs to iterate overthe same steps- data preparation, model planning, model buildingand model evaluation, for improving accuracy. If we directlyjump to model building, we end up directionless after one iteration.Following are few defined steps per me for approaching anymachine learning problem:The first step I suggest is to understandyour problem properly with a good understanding of the businessmarket. There is no scenario like: here is the data, here is the algorithmand Bam! Proper business understanding will help you inhandling the data in upcoming steps. For example, if you do nothave any idea about the banking system you will not understand ifa feature like income of customer, should be included or not. Thenext step is to collect relevant data for your problem. Other thanthe data you have internally in your company, you should also addexternal data source. For example, for sales prediction you shouldunderstand the market scenario for sales of your product. GDPmay affect your sales or may be population affects. So, collect suchkind of external data. Also remember the fact that any externaldata that you use should be available to you in the future whenyour model gets deployed. Like if you use population in your model,next year also you should be able to collect this data for gettingpredictions in the next year. I have seen many people who only usetheir internal data without realizing the importance of externaldata to their dataset. But in reality, external features have a goodimpact on our use case. Now when you have collected all therelevant data for your problem, you must divide it for training andtesting. Many data scientists follow the 70/30 rule to divide thedata into two parts: training and test set. While many follow the60/20/20 rule to divide the data into three parts: training set, testset and validation set. I prefer the second option because in thiscase you use test set for improving your model and validation setfor final verification of your model in actual scenario. with it. I wasworking on a default loan prediction problem. My accuracy was78%. I took my problem to the person who was handling financialsystems related to loans.