Hadoop Tutorial: Part 2 - Hadoop Distributed File System (HDFS)

In the last tutorial on  What is Hadoop?  i have given you a brief idea about Hadoop. So the two integral parts of Hadoop is Hadoop HDFS and Hadoop MapReduce.

Lets go further deep inside HDFS.

Hadoop Distributed File System (HDFS)  Concepts:

First take a look at the following two terminologies that will be used while describing HDFS.

Cluster: A hadoop cluster is made by having many machines in a network, each machine is termed as a node, and these nodes talks to each other over the network.

    Block Size: This is the minimum amount of size of one block in a filesystem, in which data can be kept contiguously. The default size of a single block in HDFS is 64 Mb.

      In HDFS, Data is kept by splitting it into small chunks or parts. Lets say you have a text file of 200 MB and you want to keep this file in a Hadoop Cluster. Then what happens is that, the file breaks or splits into a large number of chunks, where each chunk is equal to the block size that is set for the HDFS cluster (which is 64 MB by default). Hence a 200 Mb of file gets split into 4 parts, 3 parts of 64 mb and 1 part of 8 mb, and each part will be kept on a different machine. On which machine which split will be kept is decided by Namenode, about which we will be discussing in details below.

      Now in a Hadoop Distributed File System or HDFS Cluster, there are two kinds of nodes, A Master Node and many Worker Nodes. These are known as:

      Namenode (master node) and Datanode (worker node).


      The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. So it contains the information of all the files, directories and their hierarchy in the cluster in the form of a Namespace Image and edit logs. Along with the filesystem information it also knows about the Datanode on which  all the blocks of a file is kept.

      A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. The client presents a filesystem interface similar to a Portable Operating System Interface (POSIX), so the user code does not need to know about the namenode and datanode to function.


      These are the workers that does the real work. And here by real work we mean that the storage of actual data is done by the data node. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

      Here one important thing that is there to note: In one cluster there will be only one Namenode and there can be N number of datanodes.

      Since the Namenode contains the metadata of all the files and directories and also knows about the datanode on which each split of files are stored. So lets say Namenode goes down then what do you think will happen?.

      Yes, if the Namenode is Down we cannot access any of the files and directories in the cluster. 
      Even we will not be able to connect with any of the datanodes to get any of the files. Now think of it, since we have kept our files by splitting it in different chunks and also we have kept them in different datanodes. And it is the Namenode that keeps track of all the files metadata. So only Namenode knows how to reconstruct a file back into one from all the splits. and this is the reason that if Namenode is down in a hadoop cluster so every thing is down. 
      This is also the reason that's why Hadoop is known as a Single Point of failure.

      Now since Namenode is so important, we have to make the namenode resilient to failure. And for that hadoop provides us with two mechanism.

      The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configuration choice is to write to local disk as well as a remote NFS mount.

      The second way is running a Secondary Namenode. Well as the name suggests, it does not act like a Namenode. So if it doesn't act like a namenode how does it prevents from the failure.

      Well the Secondary namenode also contains a namespace image and edit logs like namenode. Now after every certain interval of time(which is one hour by default)  it copies the namespace image from namenode and merge this namespace image with the edit log and copy it back to the namenode so that namenode will have the fresh copy of namespace image. Now lets suppose at any instance of time the namenode goes down and becomes corrupt then we can restart  some other machine with the namespace image and the edit log that's what we have with the secondary namenode and hence can be prevented from a total failure.
      Secondary Name node takes almost the same amount of memory and CPU for its working as the Namenode. So it is also kept in a separate machine like that of a namenode. Hence we see here that in a single cluster we have one Namenode, one Secondary namenode and many Datanodes, and HDFS consists of these three elements.

      This was again an overview of Hadoop Distributed File System HDFS, In the next part of the tutorial we will know about the working of Namenode and Datanode in a more detailed manner.We will know how read and write happens in HDFS.

      Let me know if you have any doubts in understanding anything into the comment section and i will be really glad to answer your questions :)

      If you like what you just read and want to continue your learning on BIGDATA you can subscribe to our Email and Like our facebook page

      Find Comments below or Add one

      vishwash said...

      very informative...

      Unknown said...

      Thanks for such a informatic tutorials :)
      please keep posting .. waiting for more... :)

      Anonymous said...

      Nice information........But I have one doubt like, what is the advantage of keeping the file in part of chunks on different-2 datanodes? What kind of benefit we are getting here?

      Unknown said...

      @Anonymous: Well there are lots of reasons... i will explain that with great details in the next few articles...
      But for now let us understand this... since we have split the file into two, now we can take the power of two processors(parallel processing) on two different nodes to do our analysis(like search, calculation, prediction and lots more).. Again lets say my file size is in some petabytes... Your won't find one Hard disk that big.. and lets say if it is there... how do you think that we are going to read and write on that hard disk(the latency will be really high to read and write)... it will take lots of time...Again there are more reasons for the same... I will make you understand this in more technical ways in the coming tutorials... Till then keep reading :)

      Post a comment