Study & Implementation Using Haddop
- Get link
- X
- Other Apps
Experiment
No: DA-05
TITLE:
- Write Hadoop Program to count number of words in file.
Problem
Statement:
- Implementation
of Hadoop(Single Node) Map-Reduce technique to count numbers of words
in text file.
Objective:
-
To study fundamental concept of Hadoop & Map Reduce technique.
-
To study word count process using Hadoop framework.
-
To study HDFS
Requirements
(Hw/Sw): PC,
Jdk1.5 & above ,Hadoop, Eclipse
Theory:-
Hadoop
is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers
using simple programming models. A Hadoop frame-worked application
works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local
computation and storage.
MapReduce
Hadoop MapReduce is
a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner.
The
term MapReduce actually refers to the following two different tasks
that Hadoop programs perform:
-
The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).
-
The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically
both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and
re-executes the failed tasks.
The
MapReduce framework consists of a single master JobTracker and
one slave TaskTracker per
cluster-node. The master is responsible for resource management,
tracking resource consumption/availability and scheduling the jobs
component tasks on the slaves, monitoring them and re-executing the
failed tasks. The slaves TaskTracker execute the tasks as directed by
the master and provide task-status information to the master
periodically.
The
JobTracker is a single point of failure for the Hadoop MapReduce
service which means if JobTracker goes down, all running jobs are
halted.
What
is MapReduce?
MapReduce
is a programming framework that allows us to perform distributed and
parallel processing on large data sets in a distributed environment.
-
MapReduce consists of two distinct tasks – Map and Reduce.
-
As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed.
-
So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
-
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
-
The reducer receives the key-value pair from multiple map jobs.
-
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.
MapReduce
Tutorial: A Word Count Example of MapReduce
Let
us understand, how a MapReduce works by taking an example where I
have a text file called example.txt whose contents are as
follows:
Dear,
Bear, River, Car, Car, River, Deer, Car and Bear
Now,
suppose, we have to perform a word count on the sample.txt using
MapReduce. So, we will be finding the unique words and the number of
occurrences of those unique words.
-
First, we divide the input in three splits as shown in the figure. This will distribute the work among all the map nodes.
-
Then, we tokenize the words in each of the mapper and give a hardcoded value (1) to each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in itself, will occur once.
-
Now, a list of key-value pair will be created where the key is nothing but the individual words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
-
After mapper phase, a partition process takes place where sorting and shuffling happens so that all the tuples with the same key are sent to the corresponding reducer.
-
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
MapReduce
Tutorial: Advantages of MapReduce
The
two biggest advantages of MapReduce are:
1. Parallel
Processing:
In
MapReduce, we are dividing the job among multiple nodes and each node
works with a part of the job simultaneously. So, MapReduce is based
on Divide and Conquer paradigm which helps us to process the data
using different machines. As the data is processed by multiple
machine instead of a single machine in parallel, the time taken to
process the data gets reduced by a tremendous amount as shown in the
figure below (2).
2.
Data Locality:
Instead
of moving data to the processing unit, we are moving processing unit
to the data in the MapReduce Framework. In the traditional
system, we used to bring data to the processing unit and process it.
But, as the data grew and became very huge, bringing this huge amount
of data to the processing unit posed following issues:
-
Moving huge data to processing is costly and deteriorates the network performance.
-
Processing takes time as the data is processed by a single unit which becomes the bottleneck.
-
Master node can get over-burdened and may fail.
Now,
MapReduce allows us to overcome above issues by bringing the
processing unit to the data. So, as you can see in the above image
that the data is distributed among multiple nodes where each node
processes the part of the data residing on it. This allows us to have
the following advantages:
-
It is very cost effective to move processing unit to the data.
-
Every node gets a part of the data to process and therefore, there is no chance of a node getting overburdened.
Hadoop
Distributed File System
Hadoop
can work directly with any mountable distributed file system such as
Local FS, HFTP FS, S3 FS, and others, but the most common file system
used by Hadoop is the Hadoop Distributed File System (HDFS).
The
Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS) and provides a distributed file system that is designed
to run on large clusters (thousands of computers) of small computer
machines in a reliable, fault-tolerant manner.
HDFS
uses a master/slave architecture where master consists of a
single NameNode that
manages the file system metadata and one or more slave DataNodes that
store the actual data.
A
file in an HDFS namespace is split into several blocks and those
blocks are stored in a set of DataNodes. The NameNode determines the
mapping of blocks to the DataNodes. The DataNodes takes care of read
and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given
by NameNode.
HDFS
provides a shell like any other file system and a list of commands
are available to interact with the file system. These shell commands
will be covered in a separate chapter along with appropriate
examples.
ADD
FIGURE OF HDFS ARCHITECTURE HERE.
Functions
of NameNode:
-
It is the master daemon that maintains and manages the DataNodes (slave nodes)
-
It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the metadata:
-
FsImage: It contains the complete state of the file system namespace since the start of the NameNode.
-
-
It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
-
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
-
It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
-
In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes.
DataNode:
DataNodes
are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high
quality or high-availability. The DataNode is a block server that
stores the data in the local file ext3 or ext4.
Functions of DataNode:
-
These are slave daemons or process which runs on each slave machine.
-
The actual data is stored on DataNodes.
-
The DataNodes perform the low-level read and write requests from the file system’s clients.
Till
now, you must have realized that the NameNode is pretty much
important to us. If it fails, we are doomed. But don’t worry,
we will be talking about how Hadoop solved this single point of
failure problem in the next Apache Hadoop HDFS Architecture blog. So,
just relax for now and let’s take one step at a time.
Secondary
NameNode:
Apart
from these two daemons, there is a third daemon or a process
called Secondary NameNode. The Secondary NameNode works concurrently
with the primary NameNode as a helper
daemon. And
don’t be confused about the Secondary NameNode being a backup
NameNode because it is not.
Functions
of Secondary NameNode:
-
The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system.
-
It is responsible for combining the EditLogs with FsImage from the NameNode.
-
It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage. The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time.
How
Does Hadoop Work?
Stage
1
A
user/application can submit a job to the Hadoop (a hadoop job client)
for required process by specifying the following items:
-
The location of the input and output files in the distributed file system.
-
The java classes in the form of jar file containing the implementation of map and reduce functions.
-
The job configuration by setting different parameters specific to the job.
Stage
2
The
Hadoop job client then submits the job (jar/executable etc) and
configuration to the JobTracker which then assumes the responsibility
of distributing the software/configuration to the slaves, scheduling
tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage
3
The
TaskTrackers on different nodes execute the task as per MapReduce
implementation and output of the reduce function is stored into the
output files on the file system.
Advantages
of Hadoop
-
Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
-
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
-
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
-
Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
Sample
Algorithm For Map Reduce:
map(String
input_key, String input_value):
//
input_key: document name
//
input_value: document contents
for
each word w in input_value:
EmitIntermediate(w,
"1");
reduce(String
output_key, Iterator intermediate_values):
//
output_key: a word
//
output_values: a list of counts
int
result = 0;
for
each v in intermediate_values:
result
+= ParseInt(v);
Input:
Experiment No: DA-06
TITLE: - Write Hadoop Program to search & count of words in file.
Problem Statement: - Implementation of Hadoop(Single Node) Map-Reduce technique to search & count of words in file.
Conclusion:
apache hadoop
apache hadoop (software)
big data hadoop
hadoop
hadoop 2.7.3
hadoop ecosystem
hadoop hive
hadoop training
hadoop tutorial
hadoop tutorial for beginners
what is hadoop
- Get link
- X
- Other Apps
Comments
Post a Comment