Parallel Implementation of KNN Classifier



Experiment No: HPC-
TITLE: - Write a program for parallel implementation of K Nearest Neighbor Classifier.
Problem Statement: - Implementation of KNN classifier using MPI a parallel programming model.
Objective:
  • To study KNN classifier algorithm.
  • To study MPI a parallel programming model.
Requirements (Hw/Sw): PC, gcc, MPI tool, Ubuntu system.
Theory:-
What is KNN Algorithm?
K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.
kNN is often used in search applications where you are looking for similar items, like find items similar to this one.
Algorithm suggests that if you’re similar to your neighbours, then you are one of them. For example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a frui




How does a KNN Algorithm work?
The k-nearest neighbors algorithm uses a very simple approach to perform classification. When tested with a new example, it looks through the training data and finds the k training examples that are closest to the new example. It then assigns the most common class label (among those k-training examples) to the test example.
What does ‘k’ in kNN Algorithm represent?
k in kNN algorithm represents the number of nearest neighbor points which are voting for the new test data’s class.
If k=1, then test examples are given the same label as the closest example in the training set.
If k=3, the labels of the three closest classes are checked and the most common (i.e., occurring at least twice) label is assigned, and so on for larger ks.


  • In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
  • In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.
  • k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.
  • Both for classification and regression, it can be useful to assign weight to the contributions of the neighbours, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbour a weight of 1/d, where d is the distance to the neighbour.
  • The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. Thiscan be thought of as the training set for the algorithm, though no explicit training step is required.
  • A shortcoming of the k-NN algorithm is that it is sensitive to the local structureof the data. The algorithm has nothing to do with and is not to be confused with k-means, another popular machine learning technique.
Parameter selection:
  • The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classesless distinct. A good k can be selected by various heuristic techniques. The special case where the class is predicted to be the class of the closest training sample is called the nearest neighbor algorithm.
  • The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular [citation needed] approach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes.
  • In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically
  • optimal k in this setting is via bootstrap method.


Example :
We have a data from questionary survey and objective testing with the two attribute
acid durability and strength to classify whether special tissue is good or bad where is
the
X1 = acid durability in second
X2 = strength in kg/m2
Y = classi_cation.
Now the factory produces a new paper tissue that pass laboratory test with X1 = 3
and X2 = 7. Without another expensive survey, can we guess what the classi_cation
of this new tissue is?




Solution :
1. Determine parameter K = number of nearest neighbours
Suppose use K = 3
2. Calculate the distance between the query-instance and all the training samples
Coordinate of query instance is (3, 7), instead of calculating the distance we compute
square distance which is faster to calculate (without square root)
3.Sort the distance and determine nearest neighbors based on the K-th minimum distance.
4. Gather the category of the nearest neighbors. /Notice in the second row last column that the category of nearest neighbor (Y) is not included because the rank of
this data is more than 3 (=K).
5. Use simple majority of the category of nearest neighbors as the prediction value
of the query instance
We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue
that pass laboratory test with X1 = 3 and X2 = 7 is included in *Good *category.




Algorithm:
1. Determine parameters K= number of nearest neighbour.
2. Calculate the distance between the query-instance and all the training samples
3. Sort the distance and determine nearest neighbours based on the k-th minimum
distance
4. Gather the category Y of the nearest neighbours.
5. Use simply majority of the category of nearest neighbour of the prediction value
of query instance.
Message passing interface (MPI)
The message passing interface (MPI) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory.




In parallel computing, multiple computers -- or even multiple processor cores within the same computer -- are called nodes.  Each node in the parallel arrangement typically works on a portion of the overall computing problem. The challenge then is to synchronize the actions of each parallel node, exchange data between nodes and provide command and control over the entire parallel cluster. The message passing interface defines a standard suite of functions for these tasks.




MPI is not endorsed as an official standard by any standards organization such as IEEE or ISO, but it is generally considered to be the industry standard and it forms the basis for most communication interfaces adopted by parallel computing programmers. The older MPI 1.3 standard (dubbed MPI-1) provides over 115 functions.




Message Passing:

  • A process is a program counter and address space.
  • Message passing is used for communication among processes.
  • Inter-process communication:
    • Type:
Synchronous / Asynchronous
    • Movement of data from one process’s address space to another’s
What is Message Passing?
  • Data transfer.
  • Requires cooperation of sender and receiver
  • Cooperation not always apparent in code




What IS MPI?
  • A message-passing library specifications:
      • Extended message-passing model
      • Not a language or compiler specification
      • Not a specific implementation or product
  • For parallel computers, clusters, and heterogeneous networks.
  • Communication modes: standard, synchronous, buffered, and ready.
  • Designed to permit the development of parallel software libraries.
  • Designed to provide access to advanced parallel hardware for
      • End users
      • Library writers
      • Tool developers
Group and Context
  • Are two important and indivisible concepts of MPI.
  • Group: is the set of processes that communicate with one another.
  • Context: it is somehow similar to the frequency in radio communications.
  • Communicator: is the central object for communication in MPI. Each communicator is associated with a group and a context




Communication Modes
  • Based on the type of send:
    • Synchronous: Completes once the acknowledgement is received by the sender.
    • Buffered send: completes immediately, unless if an error occurs.
    • Standard send: completes once the message has been sent, which may or may not imply that the message has arrived at its destination.
    • Ready send: completes immediately, if the receiver is ready for the message it will get it, otherwise the message is dropped silently.




Basic Commands & Simple MPI Functions
  • The Simple MPI (six functions that make most of programs work):
    • MPI_INIT:
    • MPI_FINALIZE
    • MPI_COMM_SIZE
    • MPI_COMM_RANK
    • MPI_SEND
    • MPI_RECV








Skeleton MPI Program
#include <mpi.h>
main( int argc, char** argv )
{
MPI_Init( &argc, &argv );
/* main part of the program */
/*
Use MPI function call depend on your data partitioning and the parallelization architecture
*/
MPI_Finalize();
}




Initializing MPI
  • The initialization routine MPI_INIT is the first MPI routine called.
  • MPI_INIT is called once
int mpi_Init( int *argc, char **argv );
MPI_Init(NULL, NULL);
Explanation of Basics
  • #include “mpi.h” provides basic MPI definitions and types.
  • MPI_Init starts MPI
  • MPI_Finalize exits MPI
  • Note that all non-MPI routines are local;
  • Note: MPI functions return error codes or MPI_SUCCESS
          • MPI_Comm_size : Number of processes working on problem
          • MPI_Comm_rank: ID of Process
      • Rank is with respect to a communicator (context of the communication).
      • MPI_COMM_WORLD is a predefined communicator that includes all processes (already mapped to processors).
  • MPI_Comm_size(MPI_COMM_WORLD, &sizeofgroup);
  • MPI_Comm_rank(MPI_COMM_WORLD, &rank);


Data Types
  • The data message which is sent or received is described by a triple (address, count, datatype).
  • The following data types are supported by MPI:
    • Predefined data types that are corresponding to data types from the programming language.
    • Arrays.
    • Sub blocks of a matrix
    • User defined data structure.
    • A set of predefined data types
MPI SEND
MPI_SEND(void *start, int count,MPI_DATATYPE datatype, int dest, int tag, MPI_COMM comm)
  • The message buffer is described by (start, count, datatype).
  • dest is the rank of the target process in the defined communicator.
  • tag is the message identification number.




MPI RECEIVE MPI_RECV(void *start, int count, MPI_DATATYPE datatype, int source, int tag, MPI_COMM comm, MPI_STATUS *status)
  • Source is the rank of the sender in the communicator.
  • The receiver can specify a wildcard value for souce (MPI_ANY_SOURCE) and/or a wildcard value for tag (MPI_ANY_TAG), indicating that any source and/or tag are acceptable
  • Status is used for exrtra information about the received message if a wildcard receive mode is used.
  • If the count of the message received is less than or equal to that described by the MPI receive command, then the message is successfully received. Else it is considered as a buffer overflow error.




  • A receive operation may accept messages from an arbitrary sender, but a send operation must specify a unique receiver.
  • Source equals destination is allowed, that is, a process can send a message to itself.




Compile and run the code
  • Compile using:
mpicc –o exec-filename programname.c
Or
mpic++ –o exec-filename programname.cpp
e.g. mpicc –o knn-mpi knn-mpi.c
Or
mpic++ –o knn-mpi knn-mpi.cpp
Note: Before compilation, sudo apt-get install libopenmpi-dev
  • Run using:
mpirun –np # of procs exec-filename
e.g. mpirun –np 20 knn-mpi
Note: Number of processors is equal to number of testing instances
Input :
Input in this should be example set i.e the trained data.
Output :
output will get generated based on the classi_cation of this nearest neighbor from
Exapmle set.





                                         MPI SAMPLE PROGRAM




 KNN MANUAL,INSTALLATION STEPS OF OpenMP ,MPI & program



Conclusion:

Comments

Popular Posts