Experiment
No: DA-05
TITLE:
- Write Hadoop Program to count number of words in file.
Problem
Statement:
- Implementation
of Hadoop(Single Node) Map-Reduce technique to count numbers of words
in text file.
Objective:
Requirements
(Hw/Sw): PC,
Jdk1.5 & above ,Hadoop, Eclipse
Theory:-
Hadoop
is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers
using simple programming models. A Hadoop frame-worked application
works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local
computation and storage.
MapReduce
Hadoop MapReduce is
a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner.
The
term MapReduce actually refers to the following two different tasks
that Hadoop programs perform:
-
The
Map Task: This
is the first task, which takes input data and converts it into a set
of data, where individual elements are broken down into tuples
(key/value pairs).
-
The
Reduce Task: This
task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always
performed after the map task.
Typically
both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and
re-executes the failed tasks.
The
MapReduce framework consists of a single master JobTracker and
one slave TaskTracker per
cluster-node. The master is responsible for resource management,
tracking resource consumption/availability and scheduling the jobs
component tasks on the slaves, monitoring them and re-executing the
failed tasks. The slaves TaskTracker execute the tasks as directed by
the master and provide task-status information to the master
periodically.
The
JobTracker is a single point of failure for the Hadoop MapReduce
service which means if JobTracker goes down, all running jobs are
halted.
What
is MapReduce?
MapReduce
is a programming framework that allows us to perform distributed and
parallel processing on large data sets in a distributed environment.
-
MapReduce
consists of two distinct tasks – Map and Reduce.
-
As
the name MapReduce suggests, reducer phase takes place after mapper
phase has been completed.
-
So,
the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
-
The
output of a Mapper or map job (key-value pairs) is input to the
Reducer.
-
The
reducer receives the key-value pair from multiple map jobs.
-
Then,
the reducer aggregates those intermediate data tuples (intermediate
key-value pair) into a smaller set of tuples or key-value pairs
which is the final output.
MapReduce
Tutorial: A Word Count Example of MapReduce
Let
us understand, how a MapReduce works by taking an example where I
have a text file called example.txt whose contents are as
follows:
Dear,
Bear, River, Car, Car, River, Deer, Car and Bear
Now,
suppose, we have to perform a word count on the sample.txt using
MapReduce. So, we will be finding the unique words and the number of
occurrences of those unique words.
-
First,
we divide the input in three splits as shown in the figure. This
will distribute the work among all the map nodes.
-
Then,
we tokenize the words in each of the mapper and give a hardcoded
value (1) to each of the tokens or words. The rationale behind
giving a hardcoded value equal to 1 is that every word, in itself,
will occur once.
-
Now,
a list of key-value pair will be created where the key is nothing
but the individual words and value is one. So, for the first line
(Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1;
River, 1. The mapping process remains the same on all the nodes.
-
After
mapper phase, a partition process takes place where sorting and
shuffling happens so that all the tuples with the same key are sent
to the corresponding reducer.
-
So,
after the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc.
-
Now,
each Reducer counts the values which are present in that list of
values. As shown in the figure, reducer gets a list of values which
is [1,1] for the key Bear. Then, it counts the number of ones in the
very list and gives the final output as – Bear, 2.
-
Finally,
all the output key/value pairs are then collected and written in the
output file.
MapReduce
Tutorial: Advantages of MapReduce
The
two biggest advantages of MapReduce are:
1. Parallel
Processing:
In
MapReduce, we are dividing the job among multiple nodes and each node
works with a part of the job simultaneously. So, MapReduce is based
on Divide and Conquer paradigm which helps us to process the data
using different machines. As the data is processed by multiple
machine instead of a single machine in parallel, the time taken to
process the data gets reduced by a tremendous amount as shown in the
figure below (2).
Fig.: Traditional
Way Vs. MapReduce Way – MapReduce Tutorial
2.
Data Locality:
Instead
of moving data to the processing unit, we are moving processing unit
to the data in the MapReduce Framework. In the traditional
system, we used to bring data to the processing unit and process it.
But, as the data grew and became very huge, bringing this huge amount
of data to the processing unit posed following issues:
-
Moving
huge data to processing is costly and deteriorates the network
performance.
-
Processing
takes time as the data is processed by a single unit which becomes
the bottleneck.
-
Master
node can get over-burdened and may fail.
Now,
MapReduce allows us to overcome above issues by bringing the
processing unit to the data. So, as you can see in the above image
that the data is distributed among multiple nodes where each node
processes the part of the data residing on it. This allows us to have
the following advantages:
-
It
is very cost effective to move processing unit to the data.
-
The
processing time is reduced as all the nodes are working with their
part of the data in parallel.
-
Every
node gets a part of the data to process and therefore, there is no
chance of a node getting overburdened.
Hadoop
Distributed File System
Hadoop
can work directly with any mountable distributed file system such as
Local FS, HFTP FS, S3 FS, and others, but the most common file system
used by Hadoop is the Hadoop Distributed File System (HDFS).
The
Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS) and provides a distributed file system that is designed
to run on large clusters (thousands of computers) of small computer
machines in a reliable, fault-tolerant manner.
HDFS
uses a master/slave architecture where master consists of a
single NameNode that
manages the file system metadata and one or more slave DataNodes that
store the actual data.
A
file in an HDFS namespace is split into several blocks and those
blocks are stored in a set of DataNodes. The NameNode determines the
mapping of blocks to the DataNodes. The DataNodes takes care of read
and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given
by NameNode.
HDFS
provides a shell like any other file system and a list of commands
are available to interact with the file system. These shell commands
will be covered in a separate chapter along with appropriate
examples.
ADD
FIGURE OF HDFS ARCHITECTURE HERE.
Functions
of NameNode:
-
It
is the master daemon that maintains and manages the DataNodes (slave
nodes)
-
It
records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files,
permissions, hierarchy, etc. There are two files associated with the
metadata:
-
It
records each change that takes place to the file system metadata.
For example, if a file is deleted in HDFS, the NameNode will
immediately record this in the EditLog.
-
It
regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.
-
It
keeps a record of all the blocks in HDFS and in which nodes
these blocks are located.
-
The
NameNode is also responsible to take care of
the replication factor of all
the blocks which we will discuss in detail later in this HDFS
tutorial blog.
-
In case
of the DataNode failure,
the NameNode chooses new DataNodes for new replicas, balance disk
usage and manages the communication traffic to the DataNodes.
DataNode:
DataNodes
are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high
quality or high-availability. The DataNode is a block server that
stores the data in the local file ext3 or ext4.
Functions
of DataNode:
-
These
are slave daemons or process which runs on each slave machine.
-
The
actual data is stored on DataNodes.
-
The
DataNodes perform the low-level read and write requests from
the file system’s clients.
-
They
send heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds.
Till
now, you must have realized that the NameNode is pretty much
important to us. If it fails, we are doomed. But don’t worry,
we will be talking about how Hadoop solved this single point of
failure problem in the next Apache Hadoop HDFS Architecture blog. So,
just relax for now and let’s take one step at a time.
Secondary
NameNode:
Apart
from these two daemons, there is a third daemon or a process
called Secondary NameNode. The Secondary NameNode works concurrently
with the primary NameNode as a helper
daemon. And
don’t be confused about the Secondary NameNode being a backup
NameNode because it is not.
Functions
of Secondary NameNode:
-
The
Secondary NameNode is one which constantly reads all the file
systems and metadata from the RAM of the NameNode and writes it into
the hard disk or the file system.
-
It
is responsible for combining the EditLogs with FsImage from
the NameNode.
-
It
downloads the EditLogs from the NameNode at regular intervals and
applies to FsImage. The new FsImage is copied back to the NameNode,
which is used whenever the NameNode is started the next time.
How
Does Hadoop Work?
Stage
1
A
user/application can submit a job to the Hadoop (a hadoop job client)
for required process by specifying the following items:
-
The
location of the input and output files in the distributed file
system.
-
The
java classes in the form of jar file containing the implementation
of map and reduce functions.
-
The
job configuration by setting different parameters specific to the
job.
Stage
2
The
Hadoop job client then submits the job (jar/executable etc) and
configuration to the JobTracker which then assumes the responsibility
of distributing the software/configuration to the slaves, scheduling
tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage
3
The
TaskTrackers on different nodes execute the task as per MapReduce
implementation and output of the reduce function is stored into the
output files on the file system.
Advantages
of Hadoop
-
Hadoop
framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and
work across the machines and in turn, utilizes the underlying
parallelism of the CPU cores.
-
Hadoop
does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed
to detect and handle failures at the application layer.
-
Servers
can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
-
Another
big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
Sample
Algorithm For Map Reduce:
map(String
input_key, String input_value):
//
input_key: document name
//
input_value: document contents
for
each word w in input_value:
EmitIntermediate(w,
"1");
reduce(String
output_key, Iterator intermediate_values):
//
output_key: a word
//
output_values: a list of counts
int
result = 0;
for
each v in intermediate_values:
result
+= ParseInt(v);
Input:
Experiment No: DA-06
TITLE: - Write Hadoop Program to search & count of words in file.
Problem Statement: - Implementation of Hadoop(Single Node) Map-Reduce technique to search & count of words in file.
Conclusion:
Comments
Post a Comment