Big Data in simple words!

Big data is everywhere. From Netflix to digitization of simple manual forms have been possible only because of big data. Big data has not only made data storage and processing faster but also cheaper and affordable.

In this article, I will take you through what is big data and how it is different from the traditional approach of storing and processing data.

Traditional Approach

The traditional approach of data storage involves a server with an expected storage limit: for example, you start with a server of 100 TB and when you see that your data is growing more than 100 TB, you add one more disk to the same machine. As and when there is an expansion of data and the storage capacity needs to be increased, more and more disk addition happens to the same machine. So it is the same machine, but you keep adding more and more disk- more and more storage capacity- to that machine. This is called vertical expansion.

The below image shows the traditional as well as big data approaches of data storage.

big-data-storage-and-processing-explained-eb9847e6cdec

Traditional vs Big Data Storage (image by Author)

Big Data Approach

There is a limit to the traditional approach of vertical expansion. How long you can keep adding disks to your machine. In big data it is different. Let’s say you start with a 100 TB machine and you need more storage. You add one more machine, to make the overall storage capacity 140 TB. And these machines work in tandem like one machine. If you need more capacity, you can add one more machine: so you keep adding new machines as and when you need more storage or more processing power.
That is big data!
In this case, the expansion is horizontal, and there is no limit to it because you can keep adding more processing power and more storage by adding additional machines. So that is one great advantage of big data: Scalability.

One very important thing in big data is how the files are stored. In the traditional approach when we save a file, it is saved only once on the hard disk. For example, if I have three different files and I save them on my hard disk it is stored only once. So if this machine fails, I lose all the files.

In big data, you can configure it to store files more than once, on different machines in the cluster. So, for example, if you have three files:

File 1 is stored in machine 1 as well as machine 4.
File 2 is stored in machine 1 as well as machine 3.
File 3 is stored in machine 2, as well as machine 4.

The below image shows how big data replicates the files on more than one machine.

Big Data Storage and file replication (image by Author)

So in this case, if any machine goes down, you do not lose your data because those files are also stored in some other machines. For example, if machine 1 crashes, you still have file 1 and file 2 stored in different machines: machine 4 and machine 3, so you do not lose your file.
So that is one big advantage of using big data!
But how is all this done?

These machines cannot just start talking to each other. When you store the file, this file does not get replicated just like that. You need some way of making these machines talk to each other and some way of replicating these files when somebody saves the file on big data.

And that is done by Hadoop: software for big data.

Hadoop

You can consider Hadoop as software written in Java, which makes all these
computers talk to each other. That means you need to install Hadoop on all these machines and then they will start talking to each other. And when it comes to big data storage it is called Hadoop Distributed File System (HDFS).

In HDFS, one machine is considered as a Name node and all other machines are considered as Data nodes: the node where the data is stored.
Name node keeps a log of which file is stored on which machine. That is very important because if a machine fails, the name node knows the files stored on that machine and where these files are replicated. So, it will replicate the files on the failed machine to other machines. So that’s how important the name node is. Without a name node, Big data will not know which file is stored on which machine. Data nodes have data and the Name node has information where data is stored.

The below image explains HDFS architecture.

Big Data Processing (image by Author)

It’s time to take a closer look and understand what is happening under the hood in HDFS. Let’s consider do you have a big file 438 MB file and you are storing it in Big Data.

The below image explains what is going on in HDFS.

HDFS File store block size (image by Author)

When the admin is configuring big data, he/she has to mention the block size, and the minimum block size on big data is 128 MB. Block size can be configured as 256 MB on more than that, but usually, the minimum is 128 MB. To give you the context to understand this better, in normal Windows operating system PC the block size is 512 kilobytes.

Big data is good for storing large files.

Let’s consider you have a file of size 438 MB and your big data is configured with128 MB. This file will go in four blocks: three blocks of 128 MB and the last block will be 54 MB. And as explained above, not all the blocks will be on one machine. It is going to create a copy of each block on different machines.

The first block will be on machine 1 and machine 4
The second block will be on machine 2 and machine 3
The third block will be on machine 2 and machine 4
and the last block will be on Machine 1 and Machine 3.

This is the data redundancy, HDFS creates so that there is always a second copy of the same block. And when it comes to retrieving the files, HDFS retrieves all these data blocks from each of these machines combines them, and then produces them.

How data is processed in big data?

Now consider this: you have so many machines, right? And these machines have storage as well as CPUs. So while you are able to use the storage power of these machines through big data, you should also be able to use processing power- CPUs-of these machines.
And this is a big advantage!
Big data combines the processing power of these machines.

Consider the traditional approach wherein you have only one machine and you have to keep adding CPU to that machine. In big data, you have the combined processing power of all the machines connected to the big data cluster.
Like storage is controlled by the Name node and Data node, the data processing-the usage of CPUs and processing power of the data processing power of big data- is controlled through Job tracker and Task tracker.
Task trackers are the ones that are actually doing the work, and the Job tracker keeps the log of the health of that task trackers.

Conclusion

Big data -with its horizontal storage and combined processing power architecture- is a big leap forward in making data storage and processing not only faster but also affordable.

Looking forward to your feedback!

References:

Big Data and Spark for Data Engineers

Big Data Explained

Traditional Approach

Big Data Approach

Hadoop

Conclusion

References:

Recommend

广电总局拟批复“中国（北京）高新视听产业园”

新手引导设计

Pandas – Rename Column Names in Dataframe

BlackBerry shares drop despite better-than-expected earnings report

以涨价填补缺口爱奇艺更需要内容为王

群众的谎言为什么你那么穷？

将独显芯片放入手机，会发生什么？

这些设计为什么能给iF设计奖评审团留下深刻印象？

0基础12天精通网站建设，边旅游边工作还能月赚6000

最新骁龙加持一加10 Pro官宣！刘作虎：1月见

About Joyk