My first Data Science project and what I’ve learned with it.

Alvaro Matsuda
10 min readJun 3, 2021

--

Photo by Isaac Smith on Unsplash

Hi everyone!

As my first data science project I’ve learned a lot doing it and I would like to share some things, as it may help others and is a way to register my progress.

This post do not have any technical knowledge and is not the intend of it. I’ll show the steps I took to develop the project and tell the Why’s of each step.

The Project

The business context: Rossmann is a drugstore chain that operates in many countries in Europe. The CFO asked stores managers to predict their sales up to six weeks in advance because they are going to reform their store to match a new visual identity and with the prediction he can plan better how to allocate the budget for each store.

Proposed solution: Build a regression model to predict the sales up to six weeks in advance, as it helps the CFO to plan better the allocation of the budget for each store. To facilitate the access to the sales prediction I deployed the model to be accessed through a Telegram Bot that returns the prediction given an Id store.

You can see the whole development of the project here:
https://github.com/AlvaroMatsuda/Rossmann_Sales_Prediction

The 10 steps taken to develop this project

At this section I’ll go through each step explaining the importance and the why’s of each one of them.

Step 1. Data Description

This is the first step after collecting the data. The purpose of it is to have a quick look at the data. Here we calculate basics statistics of all variables such as count, mean, median, mode, minimum, maximum, quantiles, range, standard deviation and skew and kurtosis and check data types and missing values.

With all this basic statistics we already get some information of our data like if there are missing values, outliers, the distribution of the variables, the magnitude and the dimension of our dataset.

Now we have a better idea of the problem we are facing going forward. Sometime the data is too big or there are many missing values and we have to plan the develpment of the project accordingly.

Step 2. Feature Engineering

The objective of this step is to derive new variables of the data already available to try to describe better the phenomenon we are modeling.

To guide the feature engineering, I created hypothesis about what may affect the behavior of what we are modeling, in this case what may affect the sales of Rossmann’s stores. With all the hypothesis created, I could define what features/variables I was able to create/derive from the original dataset that could answer the hypothesis.

Step 3. Variable Filtration

This step is before Exploratory Data Analysis so that we can filter some of the data that could interfere on our analysis. Another reason for this step, if you work on a company, is to filter data and variables that could be a business limitation or does not make sense to analyse those data to the actual objective of the project.

For exemple, in my project I removed data of closed stores and with no sales, which represents days that the store was closed, thus there were no sales. I also removed the columns ‘customers’ as I could not predict the number of customers up to six weeks in advance. Actually this could turn into another project.

Step 4. Exploratory Data Analysis (EDA)

This step is one of the most important as it gives us business understanding, even if we are not very familiar of the company’s business model. I’ll highlight three main goals of EDA:

1- Better understanding of the business;

2- Discover which variables explains better what we are trying to predict, thus which ones are important to our model;

3- Validate hypothesis and generate insights.

That are three methods to explore our data: Univariate, Bivariate and Multivariate. We usually use visual graphics to help us analyse the data, as it is more direct.

On the univariate analysis we analyse each variables individualy one by one. To do so we usually use basic statistcs covered on the data description section and graphs such as histogram, scatterplot (for numerical variables) and boxplot and countplot (for categorical variables).

On the bivariate analysis we verify how an independant variable (variables that describes what we are trying to predict) impact on the behavior of the dependant variable (what we are trying to predict). In other words, we check if that are correlation between the independant and dependant variables. With that we can already tell which variables are important for our model as the most correlated are the ones that describes better the phenomemon. To do the bivariate analysis we use all sorts of graphics: scatterplot, barplot, lineplot, heatmap, regplot, and others.

The multivariate analysis we plot a heatmap with pearson correlation for numerical variables and in this project I calculated the cramer-V correlation for categorical variables, both gives us the correlation among all variables.

Step 5. Data Preparation

After we explored our data it’s time to prepare them to be used by the models. One of the reasons to prepare the data is that how can we compare 2 different variables with different scales like, the sales of a given store and the distance to the nearest competitor? How can we transform categorical variables into numerical variables so that algorithms can interpret, conserving the information that it has? For that, we basically have three main methods: Normalization, Rescaling and Tranformation.

The Normalization method we basically transform the distribution of the variable into a normal distribution where the mean becomes 0 with standard deviation (STD) of 1. What it implies is that the mesurement becomes the std. For exemple imagine we have a variable that has mean of 100 with std of 20, after we normalize the variable the mean becomes 0 and std of 1. Now if an observation had a value of 40 before the normalization, after it will become -3, which means that the value 40 represents -3 standard deviation from the mean of 100.

The rescaling method we rescale the range of the data to go from [0,1] or [-1,1], depending on the method. Using the same values of the exemplo above the value 40 would become 0,2 with the Min-Max scaler. There are others rescaling methods like Robust Scaler where instead of using the minimum and maximum to rescale it uses the inter quantile range (IQR) and the mean rescaling, each one of them have their pros and cons.

Finally we have the transformation where we turn categorial variables into numerical. Here we have to be carefull because we need to transform categorical into numerical variables conserving their information. For example, imagine we have a Temperaturevariable with the following values:[cold, warm, hot, very hot]. This is an ordinal categorical variable, which means that it transmit an idea of sequence. Now imagine we have a Cityvariable with the following values: [São Paulo, Rio de Janeiro, Campinas, Maceió]. This variable does not transmit an idea of sequence, are just names.

So for each case there are the best transformation method. I’m not going to discuss each method, I’ll just list some of them: One Hot Encoding, Label Encoding, Ordinal Encoding, Target Encoding, Frequency Enconding, Embedding Encoding.

Step 6. Feature Selection

Now it’s time to select the most important variables to be used by the models. Here we follow the Occam’s Razor principal that says that the most simple explanation of an event/phenomenom should prevail over more complex explanation. Applying it to Data Science, we want to select variables that best explain what we are modeling.

Again we have three main methods to select the best variables: Filter method, Embedded and Wrapper methods. In this project I used the Wrapper method, so I’ll discuss only this method.

The Wrapper method uses machine learning algorithms to evaluate all possible combination of variables calculating its performance for each combination. First it takes a subset of variables and calculate the performance of the model, then it will add another variable and train the model and calculate its performance. If the performance with the added variable is better than it is considered important, and if the performance is worst than we can conclude that it is not important.

In this step we also split our data into training and test dataset. The train dataset is used to train and fit Machine Learning Models and we can evaluate the performance of how well the model learned. With the test dataset we can evaluate the performance of generalization of the model, giving data that it never saw before. It indicates how well our model is to new data.

Step 7. Machine Learning Modeling

Here is where the magic happens. It’s where we train our model and evaluate the performance. There are types of Machine Learning Algorithms: Supervised, unsupervised and reinforcement learning.

In Supervied models we know the value of what we are predicting and for new data, it compares to the trainning dataset. With this type of model we have Classification, Regression and Time Series models.

Unsupervied models classificates/predict the class without the need to indicate the classes/predictions. Here we have Clustering models.

In Reinforcement Learning the model learn on the fly, where it maximizes reward in a particular situation. It learns through past experiences in a cumulative way.

In this project we have a supervised problem where we want to predict the sales. I used the following models: Random Forest Regressor, Linear Regression and Lasso Linear Regression and XGBoost Regressor. I calculated an average model to be a benchmark. Evaluating the performance of all models, Random Forest Regressor had the best performance and XGBoost the worst. But I chose XGBoost to see how much Hyperparameter Fine Tunning could improve the model, which I’ll talk about next.

Step 8. Hyperparameter Fine Tunning

After modelling and calculating the performance of each trained model and chose one to be implemented, we can tweak the parameters of the model to maximize its performance.

In my project I used the Random Search where you pass a list of values for each parameter and randomly sets values of the parameters and run the model. We also need to set how many time we are going to run the model. After we run the model the set amount of time we defined, we set our parameters equal to the best result we got.

There are also the Grid Search where we need to pass a list of values for each paramenter and it runs the model with every possible combination of the parameters. Than we set our parameters with the best results. It’s important to note that this method can be very expensive time wise, as it combine every possible combination and train the model.

Step 9. Translating and Interpreting the Performance

Here is where we get the real value of Data Science as we translate the performance of the model to business $$.

But first we need to interpret the performance of our model as we need to explain it to those who are not familiar with data science language. In this project I calculated the performance through 3 parameters for regression model: MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error) and RMSE (Root Mean Square Error).

What those parameters tells us is how much we are wrong with the prediction. Another way to put is how far my prediction is from the real value. The MAE and MAPE are basically the same. MAE tells us the total error of our model, summing all errors, and MAPE is the percentage of MAE. RMSE is another way to tell the error of the model, but in this case bigger error weights more on the calculation, meaning if there are many predictions very far from the real value the error will be bigger than MAE. It is most used to evaluate the performance of the model while MAE and MAPE are better to communicate to non techinical people as it is easier to understand.

Calculating those errors I could create best and worst scenarios of sales based on the error. The MAE can be interpreted like Standard Deviation, so for the predicted value we sum and subtract MAE, giving the best and worst scenario. With those scenarios and giving the error for each store the CFO can decide which scenario to take into account to plan the allocation of the budget.

Step 10. Deploy of the Model

After all the steps we’ve taken it’s finally time to deploy the model to be accessed by anyone who needs to get the prediction. On this project I deployed it on a Telegram Bot where the CFO can get the prediction on his cellphone. All we need to do is pass an ID of the store we want to get the prediction and the Bol will return the predicion.

Here is a demonstration of Telegram Bot:

Final Thoghts

It was a long way to finish this project, I got some problems during the development but in the end, I got through it. Data Science is a complex field and involves so many different knowledge that some times it gets overwhelming. But if I can get a piece of advice is: start your studies solving a real case. This makes us motivated as we get to see the results of our effort.

All the knowledge I acquered doing this project was only the essencial to solve the problem. I do not come from a math background and sometimes I didn’t get some of the calculation at first (and some I still don’t understand), but we have to have a cyclical way of study. Each time we pass through the same problem we get a little bit further.

About me

I am a geographer currently working as a data scientist. For that reason, I am an advocate of spatial data science.

Follow me for more content. I will write a post every month about data science concepts and techniques, spatial data science and GIS (Geographic Information System).

--

--

Alvaro Matsuda

Writing about Geospatial Data Science, AI, ML, Python, GIS