Hadoop Tutorial: Part 3 - Replica Placement or Replication and Read Operations in HDFS



Till now you should have got some idea of Hadoop and HDFS. In tutorial 1 and tutorial 2 we talked about the overview of Hadoop and HDFS. Lets get a bit more technical now and see how Read Operations are performed in HDFS but before that we will see what is replica of data or replication in Hadoop and how namenode manages it.

After a few tutorials, i will be posting some of the important questions and answers, most of which have been asked by our visitors and subscribers. And those are also the FAQ's for interviews and Certifications.

So now lets get down to the business.

What is Replica or Replication of Data and What is Replication Factor?


As i told you while explaining the basic understanding of BigData that data is replicated more than once in a HDFS cluster. Well one of the reasons behind this is fault tolerance. As we know that data is kept in HDFS by splitting it into chunks or blocks. Now we do replicate every block of the data on more than one nodes, and hence even one of the nodes on which data is kept is down we can still get it from a different node. So Replication is nothing but keeping same blocks of data on different nodes. And Replication factor is the number of times we are going to replicate every single block of data. So let's suppose if i am saying that my replication factor is 3(which is default in case of HDFS) this means all blocks of a file or data will be replicated on 3 different machines.

Now as Namenode is the master node hence it is decided by the namenode that on which datanode different replica of same block will be kept. So the next question comes here is:

How does it is decided that on which datanode the replica of a block will be kept ?

Well there is a trade-off between the reliability and read and write bandwidth here. Let's suppose we have kept the replication factor as 1, so in this case all the blocks of any data will be kept only once on any machine and if that machine goes down we won't be able to retrieve the data back. So here we got a reliability problem. But again placing all replicas on a single node incurs the lowest write bandwidth penalty which is good. 

Now let's say we have kept the replication factor as 5 so in this case if one, two or three of the node goes down we can still get the data blocks. And also while reading the data there is more chance of data locality or data closeness of that datanode with the client. So the data retrieval will be fast, But the write bandwidth will be high in this case, and also the data redundancy will be more. 

As we see here that there is a trade off between reliability and read and write bandwidth, In 95% of the cases we keep the replication factor as three which is suitable for most of the use case scenarios in production environment.

Again Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.

Overall, this strategy gives a good balance among reliability, write bandwidth , read performance and block distribution across the cluster. Now lets us see how read operation is performed in Hadoop Distributed File System

Read Operation in HDFS


As i told you in the last tutorial that HDFS has a master and slave kind of architecture. Namenode acts as master and Datanodes as worker. All the metadata information is with namenode and the original data is stored on the datanodes. Keeping all these in mind the below figure will give you some idea about how data flow happens between the Client interacting with HDFS, i.e. the Namenode and the Datanodes
Source: Hadoop The Definitive Guide
There are six steps involved in reading the file from HDFS:
Let's suppose a  Client (a HDFS Client) wants to read a file from HDFS. So the steps involved in reading the file is:

  • Step 1: First the Client will open the file by giving a call to open() method on FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
  • Step 2: DistributedFileSystem calls the Namenode, using RPC, to determine the locations of the blocks for the first few blocks of the file. For each block, the namenode returns the addresses of all the datanodes that have a copy of that block.
  • The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O
  • Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first closest datanode for the first block in the file.
  • Step 4: Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream.
  • Step 5: When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block. This happens transparently to the client, which from its point of view is just reading a continuous stream.
  • Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream

So what happens if DFSInpuStream encounters an error while communicating with a datanode?

Well if such an incident occurs, then DFSInpuStream will try to fetch the data from the next closest one for that block (since DFSInputStream has location of all the datanodes where that block is residing). It will also remember the datanode that was failed and will prevent itself from going on that datanode for the other blocks. The DFSInputStream also verifies checksums for the data transferred to it from the datanode. If  a  corrupted block  is  found,  it  is  reported  to  the  namenode  before  the DFSInput Stream attempts to read a replica of the block from another datanode.

One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of concurrent clients because the data traffic  is  spread  across  all  the  datanodes  in  the  cluster. Meanwhile,  the namenode merely has to service block location requests (which it stores in memory, making them very efficient) and does not, for example, serve data, which would quickly become a bottleneck as the number of clients grew.

So in this tutorial i tried to make you understand about the replication and read operations in HDFS and in the next tutorial i will be explaining the write operations in HDFS in very detailed manner.

Well you can find most of these information in any of the books but i try here to let the things go easy in much easier words to make you understand the same and also in a sequential manner. Hope you guys like it.

Let me know if you have any doubts in understanding anything into the comment section and i will be really glad to answer your questions :)



If you like what you just read and want to continue your learning on BIGDATA you can subscribe to our Email and Like our facebook page





Find Comments below or Add one

Anonymous said...

really nice tuto,clear and simple to understand ~

Anonymous said...

I don't understand your explanation about "replication factor": according to your talking, replication factor is the number of times we are going to replicate every single block of data. So replication factor is 3 that means all blocks of a file or data will be replicated for 3 times. How can I understand your explanation: all blocks of a file or data will be replicated on 3 different machines? Can this replicated data (copied 3 times) be restored in 1 machine? Could you please clarify this. I'm a bit confused about your explanation.

Anonymous said...

Very nice article. I was totally new to this topic. I didn't understood the replication well. What does this rack and nodes stand for? Is rack means set of machines belongs to a single switch? If not why it says keeping 2 blocks in a rack increase bandwidth? Thank you very much!

Post a Comment