Knowing how to analyse data and derive results from it is only part of the job of a data scientist. Presenting results in a synthetic and relevant way is also part of the arsenal of skills needed to excel in the job. This is what we call Data Visualisation. During this workshop, Kent Aquereburu, Data Scientist at Societe Generale, will present the best-practices to know to succeed in this field.
The preliminary phases of a Machine Learning project
Before even starting a data science project, it is important to understand the business need of the company because this is the best way to know how to prepare our data and to choose the variables that will have an impact on what we are trying to predict.
Once the need is understood, we will try to define the variable we are trying to predict. This could be attrition, income or a category for example.
Finally, let's try to find out why we do Machine Learning. Are we trying to predict a phenomenon or explain the causes of a phenomenon or both? Depending on the answer, the use we make of a model will be different. Indeed, you may have correlations that have nothing to do with each other.
For example, a correlation has been found between the suicide rate in the US and the budgetary expenditure of Americans on science, technology and astronomy. However, this correlation is spurious because the suicide rate and budget spending have nothing to do with each other.
In a Machine Learning model, one could include these two variables and the model would be quite capable of making predictions. On the other hand, it would be much more difficult to give an explanation for all this. This is why we need to find variables that can explain in a better way the thing we want to predict the major steps of a Machine Learning project.
Here is how one might classify the stages of a Machine Learning project:
- 1. Recovery of raw data
- 2. Pre-processing of data
- 3. Feature Engineering
- 4. Dividing the database into test data and training data
- 5. Choice of algorithm
- 6. Learning the algorithm
- 7. Prediction of test data
- 8 Prediction on real data
- 9. Reporting of results
The raw data can be retrieved from any data source: there are, for example, APIs that allow you to obtain web data, CRM data or data from excel files.
Pre-processing of data allows this database to be "cleaned". Indeed, there may be missing or inconsistent data. You will therefore do everything possible to improve the "quality" of your data because this is what will enable your model to increase its predictive performance.
Feature Engineering is the choice of variables that will actually have an influence on what you want to predict.
The aim is to refine the pre-processed data and keep only what is really useful for the model. Data visualisation comes into play in this part of the project because we need to explore the different variables to determine what will and will not have an influence.
In the fourth part, we split the data into a training base and a test base. The reason for doing this is to see right away whether our model actually performs or not.
We will then choose the algorithm that we will use. This model depends on your problem and the capabilities of your computer. If you want to know more about Machine Learning models, please have a look at our introduction to Machine Learning.
Once your model is ready and trained, you will use it to predict your real data and render the results. This is also where data visualisation comes in.
Best Practices in Data Visualisation
Let's first define what Data Visualisation is
DataViz is about transforming raw data into visual information to communicate a message.
The purpose of data visualisation is to highlight a part of your results rather than to make an exhaustive presentation of the data. It is indeed not uncommon to have worked for several days on a Machine Learning project but only have a few minutes to present your results. This is why you need to be able to highlight the most important points and be synthetic. To do this, you can use various tools, such as Tableau or Power BI.
In terms of good practice, it is important first of all not to want to make graphics all the time if one sentence will do.
When constructing a graph, always bear in mind that they can be easily misleading. Even if you think your message is clear, it can sometimes be misinterpreted. This can happen if you truncate the axes of your graphs. For example, instead of starting your graph at 0%, you start it at 90% and then you see big differences that are in fact only fake because there is not necessarily much difference between 90% and 100%.
Avoid double-axis graphs in the ordinates as they create spurious correlations. It is best to make two graphs with the corresponding ordinate in each.
Colours are important. Choose colours that are clearly different from each other and do not use more than 6 different colours per graphic. Colours also have a code. For example, red expresses danger, green expresses serenity. It is important that you are aware of these codes so that you do not use colours that can be misinterpreted by your audience.
Scales are not necessarily useful if your aim is simply to see a trend rather than a specific number. Finally, sort your data. It really is easier to understand sorted data than unsorted data.
Get trained in Data Visualisation and dive into the world of Data with Jedha's Essentials programme.