Study 'R' Programming Platform and Download Pima Indians Diabetes dataset or Titanic dataset,Use Naive Bayes‟ Algorithm for classification


Experiment No: DA-(3,4)
TITLE: - Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm for classification
                 
                          OR
Titanic dataset in R is a table for about 2200 passengers summarized according to four factors – economic status ranging from 1st class, 2nd class, 3rd class and crew; gender which is either male or female; Age category which is either Child or Adult and whether the type of passenger survived.

·        Load the data from CSV file and split it into training and test datasets.
·        summarize the properties in the training dataset so that we can calculate
              Probabilities and make predictions.
·        Classify samples from a test dataset and a summarized training dataset
Objective:
·         To study R programming language, Naive base technique for classification.
·         Requirements (Hw/Sw): PC, R studio, Ubuntu system.
 Theory:-


  • Naive Bayes Theorem is a mathematical theorem for classification where we use existing data to predict what the outcome of a certain event will be for a given set of conditions.
  • It is Supervised learning algorithm.
  • CLASSIFICATION is a classic data mining technique based on machine learning. Basically, classification is used to classify each item in a set of data into one of a predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics.


Supervised learning
Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.
For instance, suppose you are given an basket filled with different kinds of fruits. Now the first step is to train the machine with all different fruits one by one like this:
  • If shape of object is rounded and depression at top having color Red then it will be labelled as –Apple.
  • If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from basket and asked to identify it.
Since the machine has already learned the things from previous data and this time have to use it wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in Banana category. Thus the machine learns the things from training data(basket containing fruits) and then apply the knowledge to test data(new fruit).

Naive Bayes algorithm makes a very strong assumption about the data having features independent of each other while in reality, they may be dependent in some way. In other words, it assumes that the presence of one feature in a class is completely unrelated to the presence of all other features.
          If this assumption of independence holds, Naive Bayes performs extremely well and often better than other models. Naive Bayes can also be used with continuous features but is more suited to categorical variables. If all the input features are categorical, Naive Bayes is recommended. However, in case of numeric features, it makes another strong assumption which is that the numerical variable is normally distributed.
·        R supports a package called ‘e1071’ which provides the naive bayes training function.

  • Bayes theorem is based on conditional probability.



Conditional probability.
In probability theory, conditional probability is a measure of the probability of an event (some particular situation occurring) given that (by assumption, presumption, assertion or evidence) another event has occurred.If the event of interest is B and the event A is known or assumed to have occurred, "the conditional probability of B
 Given A", or "the probability of B under the condition A", is usually written as P(B|A), or P(B/A).

 A-priori probability
A priori probability is calculated by logically examining a circumstance or existing information regarding a situation.
        It usually deals with independent events where the likelihood of a given event occurring is in no way influenced by previous events.
 An example of this would be a coin toss. The largest drawback to this method of defining probabilities is that it can only be applied to a finite set of events as most events are subject to conditional probability to at least a small degree.

Confusion Matrix

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix.
      A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
         It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.

How to Install
·        Cooman:ctrl+L to clear console;
·        Package to instal:e1071,gplt2

Input:


Program-I(Applying Naive Bayes to Titanic Dataset)

#Getting started with Naive Bayes
#Install the package
#install.packages(“e1071”)
#Loading the library
library(e1071)
#?naiveBayes #The documentation also contains an example implementation of Titanic dataset
#Next load the Titanic dataset
data(Titanic)

#view dataset
View(Titanic)
#show initial part of dataset
head(Titanic)
#display end part of dataset
tail(Titanic)

#Save into a data frame and view it
#A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable 
#and each row contains one set of values from each column.
Titanic_df=as.data.frame(Titanic)

#This will repeat each combination equal to the frequency of each combination
#repeat loop is used to iterate over a block of code and datasetmultiple number of times.
repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq) 

#Creating data from table
Titanic_dataset=Titanic_df[repeating_sequence,]

#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL

#Fitting the Naive Bayes model
#What does the model say? Print the model summary
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
Naive_Bayes_Model

#Prediction on the dataset
#Confusion matrix to check accuracy
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)

table(NB_Predictions,Titanic_dataset$Survived)

Output of Program-I:
sndcoe@sndcoe-ThinkCentre-M72e:~$ Rscript titanic.R
[1]  0  0 35  0  0  0
[1]  75 192 140  80  76  20

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      No      Yes 
0.676965 0.323035 

Conditional probabilities:
     Class
Y            1st        2nd        3rd       Crew
  No  0.08187919 0.11208054 0.35436242 0.45167785
  Yes 0.28551336 0.16596343 0.25035162 0.29817159

     Sex
Y           Male     Female
  No  0.91543624 0.08456376
  Yes 0.51617440 0.48382560

     Age
Y          Child      Adult
  No  0.03489933 0.96510067
  Yes 0.08016878 0.91983122

              
NB_Predictions   No  Yes
           No  1364  362--------actual(1726)
           Yes  126  349--------actual(475)
-----------------------------
             P(1490)  P(711)
--------------------------------------------------------------------------------------------------------------------
Explaination 35+17+118+154+387+670+4+13+89+3+5+11+13+1+13+14+57+14+75+192+140+80+76+20

//A-priori probabilities:

5+11+13+1+13+14+57+14+75+192+140+80+76+20=711/2201=0.32---yes
35+17+118+154+387+670+4+13+89+3=1490/2201=0.67---no

//Conditional probabilities:class

2nd
154+13=167/1490=0.112--no
11+13+14+80/711=0.165--yes


//Conditional probabilities:sex


male

35+118+154+387+670=1364/1490=0.91--no
5+11+13+57+14+75+192=367/711=0.51



DESCRIPTION
Predict the onset of diabetes based on diagnostic measures.
SUMMARY


This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)

Inspiration:

  • some values are not in the range where they are supposed to be, should be treated as missing value.
  • What kind of method is better to use to fill this type of missing value? How further clasification will be like?
  • Is there any sub-groups significantly more likely to have diabetes?

Program-II(Applying Naive Bayes to PIMA Dibetes dataset)
mydata<-read.csv(file="/home/vnd/Desktop/dibetes/diabetes.csv",header=TRUE,sep=",")

# V of View should be capital
View(mydata)
library(caTools)
library(e1071)

#part1
#S & R should be capital in SplitRatio
temp_field <- sample.split(mydata,SplitRatio=0.7)

#70% will be in training
train <- subset(mydata,temp_field==TRUE)

#30% will be in testing
test <- subset(mydata, temp_field==FALSE)

#display few samples
head(train)
head(test)
# install Naive Bayes package i.e. e1071 and add it to the top of program library(e1071)
#part2 invoke classifier

# make a note, the class cannot be numeric, it needs to be catogarical for naive bayes
#as specified in the function, hence as.factor internally maps the 1 and 0 to catogarical value
#this will generate a model for naive bayes where the ouput class to be predicated is outcome and the data is mydata
#no other parameters are considered here as of now
#s3 methos of formula ~ is against all
my_model<- naiveBayes(as.factor(train$Outcome)~.,train)

#To see summery of the probabelities calculated-
my_model



#part3
#predicting, try putting type="class" or type="raw" after the test data
pred1<-predict(my_model,test[,-9])


#generate the confussion matrix..
table(pred1,test$Outcome,dnn=c("predicted", "Actual"))

#to save prediction
output<-cbind(test,pred1)
View(output)

 mydata<-read.csv(file="/home/vnd/Desktop/dibetes/diabetes.csv",header=TRUE,sep=",")
> 
> # V of View should be capital
> View(mydata)
> library(caTools)
> library(e1071)
> 
> #part1 
> #S & R should be capital in SplitRatio
> temp_field <- sample.split(mydata,SplitRatio=0.7)
> 
> #70% will be in training
> train <- subset(mydata,temp_field==TRUE)
> 
> #30% will be in testing
> test <- subset(mydata, temp_field==FALSE)
> 
> #display few samples
> head(train)
  Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI DiabetesPedigreeFunction Age Outcome
1           6     148            72            35       0 33.6                    0.627  50       1
2           1      85            66            29       0 26.6                    0.351  31       0
3           8     183            64             0       0 23.3                    0.672  32       1
4           1      89            66            23      94 28.1                    0.167  21       0
5           0     137            40            35     168 43.1                    2.288  33       1
8          10     115             0             0       0 35.3                    0.134  29       0
> head(test)
   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI DiabetesPedigreeFunction Age Outcome
6            5     116            74             0       0 25.6                    0.201  30       0
7            3      78            50            32      88 31.0                    0.248  26       1
9            2     197            70            45     543 30.5                    0.158  53       1
15           5     166            72            19     175 25.8                    0.587  51       1
16           7     100             0             0       0 30.0                    0.484  32       1
18           7     107            74             0       0 29.6                    0.254  31       1
> # install Naive Bayes package i.e. e1071 and add it to the top of program library(e1071)
> #part2 invoke classifier
> 
> # make a note, the class cannot be numeric, it needs to be catogarical for naive bayes 
> #as specified in the function, hence as.factor internally maps the 1 and 0 to catogarical value
> #this will generate a model for naive bayes where the ouput class to be predicated is outcome and the data is mydata
> #no other parameters are considered here as of now
> #s3 methos of formula ~ is against all
> my_model<- naiveBayes(as.factor(train$Outcome)~.,train)
> 
> #To see summery of the probabelities calculated-
> my_model

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
        0         1 
0.6491228 0.3508772 

Conditional probabilities:
   Pregnancies
Y       [,1]     [,2]
  0 3.228228 3.060954
  1 4.572222 3.576819

   Glucose
Y       [,1]     [,2]
  0 109.5556 27.19288
  1 141.7944 33.17394

   BloodPressure
Y       [,1]     [,2]
  0 67.88589 19.09548
  1 70.84444 20.62591

   SkinThickness
Y       [,1]     [,2]
  0 19.71171 14.81374
  1 20.70556 17.89938

   Insulin
Y       [,1]     [,2]
  0 67.95495 101.3626
  1 94.29444 138.1624

   BMI
Y       [,1]     [,2]
  0 30.59399 7.968936
  1 34.77778 7.175288

   DiabetesPedigreeFunction
Y        [,1]      [,2]
  0 0.4385135 0.3062517
  1 0.5595778 0.4073832

   Age
Y       [,1]     [,2]
  0 30.81081 11.47536
  1 36.93889 11.24445

> 
> 
> 
> #part3
> #predicting, try putting type="class" or type="raw" after the test data
> pred1<-predict(my_model,test[,-9])
> 
> 
> #generate the confussion matrix..
> table(pred1,test$Outcome,dnn=c("actual", "predicted"))
           predicted
actual      0   1
        0 144  37
        1  23  51
> 
> #to save prediction
> output<-cbind(test,pred1)
> View(output)
>
Conclusion:


Comments

Popular Posts