In the last 5 years, we have created more data than since the beginning of mankind. We are now producing so much data that it is becoming difficult to manage. This is what is known as Big Data. We had the pleasure of welcoming Victoria Galano, Data Scientist at Air France, for our workshop. She was able to shed some light on what Big Data is and its applications in the business world.
What is it?
"Big Data refers to Data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.
This definition comes from McKinsey Global Institute. What we can learn from it is that Big Data is linked to the problems of managing, storing and analysing data that is so voluminous that most of the tools we know today are not capable of handling it. As a result, the management of a Big Data infrastructure requires extensive expertise.
Today, we have developed new solutions capable of managing huge volumes of data such as Hadoop, Cassandra and Spark. Cloud Computing (including AWS, Google Cloud Platform and Microsoft Azure) also brings its own suite of very useful solutions for data storage and management.
We also often hear about Machine Learning. This is not to be confused with Big Data. Machine Learning is a field that enables predictive analysis to be carried out automatically. If you want to know more about it, please have a look at our article Introduction to Machine Learning
How big is data today?
"For 2017, 90% of the Data in the world today has been created in the last two years alone, at 2.5 quintillion bytes of data a day!" IBM Marketing. From frameworks such as Spark and their Scala language have emerged, to support such a large volume of data.
What we can learn from this is that the creation of data in the world follows an exponential curve. Yet today, recent studies have shown that we only analyse 0.5% of the data to make operational decisions.
How to characterise Big Data?
There is a lot of talk about the 3 Vs that define Big Data:
- Volume: As you can see, the volume of data to be processed is enormous. The digital world represented 1.2 Zettabytes (1e+9 Terabytes) of data. In 2020, it will represent 35 Zettabytes. We therefore need new ways to store and manage our databases so that we are no longer limited by storage space.
- Variety: Data is no longer just coming from Excel. Data comes from a wide variety of sources and can take many forms. For example, we have data from connected objects, tweets, Facebook posts, images and videos, so we need to develop tools that allow us to analyse these new types of data.
- Velocity: In traditional data analysis, we used to take a big packet of data, analyse it and then extrapolate results. But now the data is constantly being sent to servers and needs constant analysis.
This is why we need to have solutions that will allow us to produce continuous analysis.
Over the years, Vs have been added to the characteristics of Big Data. Here are the most popular ones
- Veracity: Is the data good or not?
- Value: Can data bring value to the business?
- Variability: Data may change type over time
- Visualisation: How to present data in a relevant way?
Where does Big Data come from?
Big Data sources are multiple. They can come from CRM, web browsers, blogs, music applications... However, we can group the data sources into main sectors:
Science, and especially astronomy, produces a lot of data. The world's largest telescope, SKA, for example, generates 400 Petabytes (4e+8 Gigabytes) of data per year.
The web is naturally an incredible source of data creation. Facebook and Twitter are said to produce around 15 Terabytes of data per day, while Google alone produces 20 TB of data.
Industry in general also generates a lot of data. A single aircraft engine generates more than 10TB of data every 30 minutes.
Finance is also a major creator of data as the New York Stock Exchange alone is capable of generating 1TB of data per day.
Big Data Applications
That's a lot of data, but what could it be used for? Well, Big Data could be extremely useful in many sectors, for modelling purposes. Discover here the difference between Big Data and Data Science, and the applications of these 2 major concepts.
- Health: The University of Los Angeles is using Big Data analysis to prevent complications from head trauma.
- Politics: Barack Obama hired a Big Data team to gain extremely precise knowledge of his voters.
- Sport: To analyse the performance of all players in all the matches they have played and to recognise talent
- Finance: Is also a sector that uses Big Data to detect fraud or predict different share prices.
- Tech: It benefits enormously from Big Data, particularly in the area of artificial intelligence with Siri, autonomous cars, chatbots, etc.
At what cost?
The main limitation of Big Data remains data protection. As we gain analytical capacity, it becomes difficult to place the cursor between real utility and violation of individual privacy.
Users are not always aware of how their personal data is used. The GDPR may be Europe's answer to this problem, but will it really do the trick? If you want to find out more about the world of tech, check out the best Big Data courses.