Parallel Implementation of KNN Classifier
- Get link
- X
- Other Apps
Experiment
No: HPC-
TITLE:
- Write a program for parallel implementation of K Nearest Neighbor
Classifier.
Objective:
-
To study KNN classifier algorithm.
-
To study MPI a parallel programming model.
Theory:-
What
is KNN Algorithm?
K
nearest neighbors or KNN Algorithm is a simple algorithm which uses
the entire dataset in its training phase. Whenever a prediction is
required for an unseen data instance, it searches through the entire
training dataset for k-most similar instances and the data with the
most similar instance is finally returned as the prediction.
kNN
is often used in search applications where you are
looking for similar items,
like find
items similar to this one.
Algorithm
suggests that if you’re
similar to your neighbours, then you are one of them.
For example, if apple looks more similar to peach, pear, and cherry
(fruits) than monkey, cat or a rat (animals), then most likely apple
is a frui
How
does a KNN Algorithm work?
The k-nearest
neighbors algorithm uses a very simple approach to perform
classification. When tested with a new example, it looks through the
training data and finds the k training examples that are
closest to the new example. It then assigns the most common class
label (among those k-training examples) to the test example.
What
does ‘k’ in kNN Algorithm represent?
k
in kNN algorithm represents the number of nearest neighbor points
which are voting for the new test data’s class.
If
k=1, then test examples are given the same label as the closest
example in the training set.
If
k=3, the labels of the three closest classes are checked and the most
common (i.e., occurring at least twice) label is assigned, and so on
for larger ks.
-
In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
-
In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.
-
k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.
-
Both for classification and regression, it can be useful to assign weight to the contributions of the neighbours, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbour a weight of 1/d, where d is the distance to the neighbour.
-
The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. Thiscan be thought of as the training set for the algorithm, though no explicit training step is required.
-
A shortcoming of the k-NN algorithm is that it is sensitive to the local structureof the data. The algorithm has nothing to do with and is not to be confused with k-means, another popular machine learning technique.
Parameter
selection:
-
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classesless distinct. A good k can be selected by various heuristic techniques. The special case where the class is predicted to be the class of the closest training sample is called the nearest neighbor algorithm.
-
The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular [citation needed] approach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes.
-
In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically
-
optimal k in this setting is via bootstrap method.
Example
:
We
have a data from questionary survey and objective testing with the
two attribute
acid
durability and strength to classify whether special tissue is good or
bad where is
the
X1
= acid durability in second
X2
= strength in kg/m2
Y
= classi_cation.
Now
the factory produces a new paper tissue that pass laboratory test
with X1 = 3
and
X2 = 7. Without another expensive survey, can we guess what the
classi_cation
of
this new tissue is?
Solution
:
1.
Determine parameter K = number of nearest neighbours
Suppose
use K = 3
2.
Calculate the distance between the query-instance and all the
training samples
Coordinate
of query instance is (3, 7), instead of calculating the distance we
compute
square
distance which is faster to calculate (without square root)
3.Sort
the distance and determine nearest neighbors based on the K-th
minimum distance.
4.
Gather the category of the nearest neighbors. /Notice in the second
row last column that the category of nearest neighbor (Y) is not
included because the rank of
this
data is more than 3 (=K).
5.
Use simple majority of the category of nearest neighbors as the
prediction value
of
the query instance
We
have 2 good and 1 bad, since 2>1 then we conclude that a new paper
tissue
that
pass laboratory test with X1 = 3 and X2 = 7 is included in *Good
*category.
Algorithm:
1.
Determine parameters K= number of nearest neighbour.
2.
Calculate the distance between the query-instance and all the
training samples
3.
Sort the distance and determine nearest neighbours based on the k-th
minimum
distance
4.
Gather the category Y of the nearest neighbours.
5.
Use simply majority of the category of nearest neighbour of the
prediction value
of
query instance.
Message
passing interface (MPI)
The
message passing interface (MPI) is a standardized means of exchanging
messages between multiple computers running a parallel program across
distributed memory.
In
parallel computing, multiple computers -- or even multiple processor
cores within the same computer -- are called nodes. Each node
in the parallel arrangement typically works on a portion of the
overall computing problem. The challenge then is to synchronize the
actions of each parallel node, exchange data between nodes and
provide command and control over the entire parallel cluster. The
message passing interface defines a standard suite of functions for
these tasks.
MPI
is not endorsed as an official standard by any standards organization
such as IEEE or ISO,
but it is generally considered to be the industry standard and it
forms the basis for most communication interfaces adopted by parallel
computing programmers. The older MPI 1.3 standard (dubbed MPI-1)
provides over 115 functions.
Message
Passing:
-
A process is a program counter and address space.
-
Message passing is used for communication among processes.
-
Inter-process communication:
-
Type:
-
Synchronous
/ Asynchronous
-
Movement of data from one process’s address space to another’s
What
is Message Passing?
-
Data transfer.
-
Requires cooperation of sender and receiver
-
Cooperation not always apparent in code
What
IS MPI?
-
A message-passing library specifications:
-
Extended message-passing model
-
Not a language or compiler specification
-
Not a specific implementation or product
-
-
For parallel computers, clusters, and heterogeneous networks.
-
Communication modes: standard, synchronous, buffered, and ready.
-
Designed to permit the development of parallel software libraries.
-
Designed to provide access to advanced parallel hardware for
-
End users
-
Library writers
-
Tool developers
-
Group
and Context
-
Are two important and indivisible concepts of MPI.
-
Group: is the set of processes that communicate with one another.
-
Context: it is somehow similar to the frequency in radio communications.
-
Communicator: is the central object for communication in MPI. Each communicator is associated with a group and a context
Communication
Modes
-
Based on the type of send:
-
Synchronous: Completes once the acknowledgement is received by the sender.
-
Buffered send: completes immediately, unless if an error occurs.
-
Standard send: completes once the message has been sent, which may or may not imply that the message has arrived at its destination.
-
Ready send: completes immediately, if the receiver is ready for the message it will get it, otherwise the message is dropped silently.
-
Basic
Commands & Simple MPI Functions
-
The Simple MPI (six functions that make most of programs work):
-
MPI_INIT:
-
MPI_FINALIZE
-
MPI_COMM_SIZE
-
MPI_COMM_RANK
-
MPI_SEND
-
MPI_RECV
-
Skeleton
MPI Program
#include
<mpi.h>
main(
int argc, char** argv )
{
MPI_Init(
&argc, &argv );
/* main part of the program */
/*
Use MPI function call depend on your data partitioning and the
parallelization architecture
*/
MPI_Finalize();
}
Initializing
MPI
-
MPI_INIT is called once
int
mpi_Init( int *argc, char **argv );
MPI_Init(NULL,
NULL);
Explanation
of Basics
-
#include “mpi.h” provides basic MPI definitions and types.
-
MPI_Init starts MPI
-
MPI_Finalize exits MPI
-
Note that all non-MPI routines are local;
-
Note: MPI functions return error codes or MPI_SUCCESS
-
MPI_Comm_size : Number of processes working on problem
-
MPI_Comm_rank: ID of Process
-
Rank is with respect to a communicator (context of the communication).
-
MPI_COMM_WORLD is a predefined communicator that includes all processes (already mapped to processors).
-
MPI_Comm_size(MPI_COMM_WORLD, &sizeofgroup);
-
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
Data
Types
-
The data message which is sent or received is described by a triple (address, count, datatype).
-
The following data types are supported by MPI:
-
Predefined data types that are corresponding to data types from the programming language.
-
Arrays.
-
Sub blocks of a matrix
-
User defined data structure.
-
A set of predefined data types
-
MPI
SEND
MPI_SEND(void
*start,
int count,MPI_DATATYPE
datatype,
int dest,
int tag,
MPI_COMM comm)
-
The message buffer is described by (start, count, datatype).
-
dest is the rank of the target process in the defined communicator.
-
tag is the message identification number.
MPI
RECEIVE
MPI_RECV(void *start,
int count,
MPI_DATATYPE datatype,
int source,
int tag,
MPI_COMM comm,
MPI_STATUS *status)
-
Source is the rank of the sender in the communicator.
-
The receiver can specify a wildcard value for souce (MPI_ANY_SOURCE) and/or a wildcard value for tag (MPI_ANY_TAG), indicating that any source and/or tag are acceptable
-
Status is used for exrtra information about the received message if a wildcard receive mode is used.
-
If the count of the message received is less than or equal to that described by the MPI receive command, then the message is successfully received. Else it is considered as a buffer overflow error.
-
A receive operation may accept messages from an arbitrary sender, but a send operation must specify a unique receiver.
-
Source equals destination is allowed, that is, a process can send a message to itself.
Compile
and run the code
-
Compile using:
mpicc –o exec-filename programname.c
Or
mpic++
–o exec-filename programname.cpp
e.g.
mpicc –o knn-mpi knn-mpi.c
Or
mpic++
–o knn-mpi knn-mpi.cpp
-
Run using:
mpirun
–np # of procs exec-filename
Input
:
Input
in this should be example set i.e the trained data.
Output
:
output
will get generated based on the classi_cation of this nearest
neighbor from
Exapmle
set.
Conclusion:
how knn works
k nearest neighbors
knn
knn classifier
MPI
OpenMp
what is knn algorithm
what is message passing interface
- Get link
- X
- Other Apps
Comments
Post a Comment