Hadoop Tutorial: Part 1 - What is Hadoop ? (an Overview)

Hadoop is an open source software framework that supports data intensive distributed applications which is licensed under Apache v2 license.

At-least this is what you are going to find as the first line of definition on Hadoop in Wikipedia. So what is data intensive distributed applications?

Well data intensive is nothing but BigData (data that has outgrown in size) and distributed applications are the applications that works on network by communicating and  coordinating with each other by passing messages. (say using a RPC interprocess communication or through Message-Queue)

Hence Hadoop works on a distributed environment and is build to store, handle and process large amount of data set (in petabytes, exabyte and more). Now here since i am saying that hadoop stores petabytes of data, this doesn't mean that Hadoop is a database. Again remember its a framework that handles large amount of data for processing. You will get to know the difference between Hadoop and Databases (or NoSQL Databases, well that's what we call BigData's databases) as you go down the line in the coming tutorials.

Hadoop was derived from the research paper published by Google on Google File System(GFS) and Google's MapReduce. So there are two integral parts of Hadoop: Hadoop Distributed File System(HDFS) and Hadoop MapReduce.

Hadoop Distributed File System (HDFS)

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
Well Lets get into the details of the statement mentioned above:

Very Large files: Now when we say very large files we mean here that the size of the file will be in a range of gigabyte, terabyte, petabyte or may be more.

Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

Commodity Hardware: Hadoop doesn't require expensive, highly reliable hardware. It’s designed to run
on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.

Now here we are talking about a FileSystem, Hadoop Distributed FileSystem. And we all know about a few of the other File Systems like Linux FileSystem and Windows FileSystem. So the next question comes is...

What is the difference between normal FileSystem and Hadoop Distributed File System?

The major two differences that is notable between HDFS and other Filesystems are:

  • Block Size: Every disk is made up of a block size. And this is the minimum amount of data that is written and read from a Disk. Now a Filesystem also consists of blocks which is made out of these blocks on the disk. Normally disk blocks are of 512 bytes and those of filesystem are of a few kilobytes.  In case of HDFS we also have the blocks concept. But here one block size is of 64 MB by default and which can be increased in an integral multiple of 64 i.e. 128MB, 256MB, 512MB or even more in GB's. It all depend on the requirement and use-cases. 
          So Why are these blocks size so large for HDFS? keep on reading and you will get it in a next few tutorials :)
  • Metadata Storage: In normal file system there is a hierarchical storage of metadata i.e. lets say there is a folder ABC, inside that folder there is again one another folder DEFand inside that there is hello.txt file. Now the information about hello.txt (i.e. metadata info of hello.txt) file will be with DEF and again the metadata of DEF will be with ABC. Hence this forms a hierarchy and this hierarchy is maintained until the root of the filesystem. But in HDFS we don't have a hierarchy of metadata. All the metadata information resides with a single machine known as Namenode (or Master Node) on the cluster. And this node contains all the information about other files and folder and lots of other information too, which we will learn in the next few tutorials. :) 
Well this was just an overview of Hadoop and Hadoop Distributed File System. Now in the next part i will go into the depth of HDFS and there after MapReduce and will continue from here...

Let me know if you have any doubts in understanding anything into the comment section and i will be really glad to answer the same :)

If you like what you just read and want to continue your learning on BIGDATA you can subscribe to our Email and Like our facebook page

Find Comments below or Add one

Romain Rigaux said...

Nice summary!

pragya khare said...

I know i'm a beginner and this question myt be a silly 1....but can you please explain to me that how PARALLELISM is achieved via map-reduce at the processor level ??? if I've a dual core processor, is it that only 2 jobs will run at a time in parallel?

Anonymous said...

Hi I am from Mainframe background and with little knowledge of core java...Do you think Java is needed for learning Hadoop in addition to Hive/PIG ? Even want to learn Java for map reduce but couldn't find what all will be used in realtime..and definitive guide books seems tough for learning mapreduce with Java..any option where I can learn it step by step?
Sorry for long comment..but it would be helpful if you can guide me..

Deepak Kumar said...

@Pragya Khare...
First thing always remember... the one Popular saying.... NO Questions are Foolish :) And btw it is a very good question.
Actually there are two things:

One is what will be the best practice? and other is what happens in there by default ?...

Well by default the number of mapper and reducer is set to 2 for any task tracker, hence one sees a maximum of 2 maps and 2 reduces at a given instance on a TaskTracker (which is configurable)..Well this Doesn't only depend on the Processor but on lots of other factor as well like ram, cpu, power, disk and others....


And for the other factor i.e for Best Practices it depends on your use case. You can go through the 3rd point of the below link to understand it more conceptually


Well i will explain all these when i will reach the advance MapReduce tutorials.. Till then keep reading !! :)

Deepak Kumar said...

As Hadoop is written in Java, so most of its API's are written in core Java... Well to know about the Hadoop architecture you don't need Java... But to go to its API Level and start programming in MapReduce you need to know Core Java.

And as for the requirement in java you have asked for... you just need simple core java concepts and programming for Hadoop and MapReduce..And Hive/PIG are the SQL kind of data flow languages that is really easy to learn...And since you are from a programming background it won't be very difficult to learn java :) you can also go through the link below for further details :)


Post a Comment