Big data are data sets that are rapidly generated and come from a variety of sources. Together, they create a huge array of data that can be used for analysis, forecasts, statistics, and decision-making.
The term “big data” only appeared in 2008, when Nature editor Clifford Lynch stated that the amount of information in the world was growing too fast. Until 2011, big data was only used in science and statistics. Since 2014, the world’s leading universities and IT giants – IBM, Google, and Microsoft – have been collecting and analyzing data.
Big data is a huge volume of structured and unstructured information. Big data also includes technologies that are used to collect, process data and use it in work.
Big data can include social media feeds, traffic sensors, satellite imagery, streaming audio and video streams, bank transactions, events at 22Bet Uganda, web and mobile app content, telemetry from cars and mobile devices, and financial market data.
Tech companies almost never delete collected information, because it could be worth many times more tomorrow than it was yesterday. And even today, it’s already generating billions in profits for many companies. The first versions of the Hadoop big data storage system did not even have a “Delete Data” command: no such function was intended.
As an example, Facebook. The company uses information about user behavior to recommend news, products within the social network. Knowledge about the audience increases users’ interest and motivates them to visit the social network as often as possible. As a consequence, Facebook’s profits grow.
And Google doesn’t just give search results based on keywords in a search query. It also takes into account the history of previous queries and the user’s interests.
The performance of computing systems has grown a lot in recent years. This can be seen in the graph of the growth in the number of transistors over the last 50 years.
There is no clear criterion for how much data can be called “big”. It’s a metric that is time and capacity dependent. For example, 30 years ago, a 10 MB hard drive was considered to hold a lot of data. In 2022, big data is 100-150GB.
This characteristic describes the speed of data accumulation. Two factors determine the accumulation speed:
- The rate of accumulation from a single data source. For example, a social network stores information about how many times a user opened a page on their computer or in an app on their smartphone. Information can be stored dozens of times a day.
- Data can also be collected from production equipment, which transmits important indicators about its state. This information can be generated 10-100 times per second.
- Number of data sources. For example, a social network has millions of users around the world. If we collect information on each user, the rate of accumulation will be – millions of records per second.
At the same time, there can be several tens of manufacturing equipment at one plant. And the final data accumulation rate will be up to a thousand records per second.
Data can differ both in content and in the type of data: structured, weakly structured and unstructured.
To build a big data management and data analytics system, you need to understand what types of data are used:
- Structured is strictly organized data. For example, in Excel, everyone works with structured data.
- Loosely structured – this is usually what is called Internet data. This includes information obtained from social networks, or the history of visiting websites. Thus, JSON and XML have a weakly structured data format. Because of the simplicity JSON is used more often, but on the basis of XML can be built more complex data structures.
- Unstructured – data of arbitrary shape and with no predetermined shape. For example, files, each of which is unique in itself. At the same time their storage must be somehow organized.
Veracity is when the data are “correct” and consistent. That is, they can be trusted and can be analyzed and used to make business decisions.
High credibility requirements are usually imposed by financial institutions. One wrong number recorded in the database can lead to incorrect reports.
But there are situations where validity is not so important. When the data accumulation rate is over a thousand records per second, then one or even ten erroneous entries will not cause a problem. After all, they will be followed by another 900 good quality records.
Data streams can vary for a variety of reasons: due to social phenomena, the seasons, external influences. When data is collected on the temperature of production equipment or a computing server, the information is constantly changing, if the temperature is measured accurately enough.
Variability also applies to the frequency with which data is received. Sometimes 1,000 records per second are received, sometimes 100 records. For example, the data changes when they collect information about the number of active users of an app. This is the case because users do not open the app every day.
Value is a factor that defines all of the key characteristics described above. It depends on the ability of the organization itself to extract value from the data and transform the knowledge into value for customers.
This is data that comes from social networks, websites, apps, and services integrated with social networks. Social data contains the history of visits to social networks, messengers, reactions to messages, news, and any other user actions.
The data that the equipment produces about itself. This can be information about the location, the internal state of the equipment (e.g. temperature) and other metrics.
Any wearable device, elements of a smart home, or production equipment in a factory are considered equipment.
These are banking or any other financial transactions. With the emergence of neobanks and fintech startups, the amount of transactional data in the world has increased dramatically.
Big data is stored in data centers with powerful servers. Modern computing systems provide instant access to all data.
Distributed storage systems are used to handle big data. Often all data does not fit on one server and it must be distributed to several servers.
Distributing the data helps to process the information faster. This is possible because each part of the data is handled by a separate server and the processing runs in parallel.
There are distributed computing systems that allow you to work with data larger than one petabyte. For example, Spark and its older version MapReduce.