Study 'R' Programming Platform and Download Pima Indians Diabetes dataset or Titanic dataset,Use Naive Bayes‟ Algorithm for classification
- Get link
- X
- Other Apps
Experiment
No: DA-(3,4)
TITLE:
- Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm
for classification
Titanic
dataset in R is a table for about 2200 passengers summarized according to four
factors – economic status ranging from 1st class, 2nd class, 3rd class and
crew; gender which is either male or female; Age category which is either Child
or Adult and whether the type of passenger survived.
·
Load the data from CSV
file and split it into training and test datasets.
·
summarize the
properties in the training dataset so that we can calculate
Probabilities and make
predictions.
·
Classify samples from a
test dataset and a summarized training dataset
Objective:
·
To study R programming language,
Naive base technique for classification.
·
Requirements (Hw/Sw):
PC, R studio, Ubuntu system.
Theory:-
- Naive Bayes Theorem is a mathematical theorem for classification where we use existing data to predict what the outcome of a certain event will be for a given set of conditions.
- It is Supervised learning algorithm.
- CLASSIFICATION is a classic data mining technique based on machine learning. Basically, classification is used to classify each item in a set of data into one of a predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics.
Supervised learning
Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.
For instance, suppose you are given an basket filled with different kinds of fruits. Now the first step is to train the machine with all different fruits one by one like this:
- If shape of object is rounded and depression at top having color Red then it will be labelled as –Apple.
- If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from basket and asked to identify it.
Since the machine has already learned the things from previous data and this time have to use it wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in Banana category. Thus the machine learns the things from training data(basket containing fruits) and then apply the knowledge to test data(new fruit).
Naive Bayes algorithm makes a very strong assumption about the
data having features independent of each other while in reality, they may be
dependent in some way. In other words, it assumes that the presence of one
feature in a class is completely unrelated to the presence of all other
features.
If this assumption
of independence holds, Naive Bayes performs extremely well and often better
than other models. Naive Bayes can also be used with continuous features but is
more suited to categorical variables. If all the input features are
categorical, Naive Bayes is recommended. However, in case of numeric features,
it makes another strong assumption which is that the numerical variable is
normally distributed.
·
R supports
a package called ‘e1071’ which provides the naive bayes training function.
- Bayes theorem is based on conditional probability.
Conditional probability.
In probability theory,
conditional probability is a measure
of the probability
of an event
(some particular situation occurring) given that (by assumption, presumption, assertion
or evidence) another event has occurred.If
the event of interest is B and the
event A is known or assumed to have
occurred, "the conditional probability of B
Given A", or "the probability of B under the condition A",
is usually written as P(B|A),
or P(B/A).
A-priori probability
A priori probability is calculated by logically examining a
circumstance or existing information regarding a situation.
It usually deals
with independent events where the likelihood of a given event occurring is
in no way influenced by previous events.
An example of this would be a coin toss. The largest
drawback to this method of defining probabilities is that it can only be
applied to a finite set of events as most events are subject to conditional
probability to at least a small degree.
Confusion Matrix
In
the field of machine learning and specifically the problem of statistical
classification, a confusion matrix, also known as an error matrix.
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
It allows the visualization of the
performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.
How to Install
·
Cooman:ctrl+L to clear
console;
·
Package to
instal:e1071,gplt2
Input:
Program-I(Applying Naive Bayes to Titanic Dataset)
#Getting started with Naive Bayes
#Install the package
#install.packages(“e1071”)
#Loading the library
library(e1071)
#?naiveBayes #The documentation also contains an example implementation of Titanic dataset
#Next load the Titanic dataset
data(Titanic)
#view dataset
View(Titanic)
#show initial part of dataset
head(Titanic)
#display end part of dataset
tail(Titanic)
#Save into a data frame and view it
#A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable
#and each row contains one set of values from each column.
Titanic_df=as.data.frame(Titanic)
#This will repeat each combination equal to the frequency of each combination
#repeat loop is used to iterate over a block of code and datasetmultiple number of times.
repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq)
#Creating data from table
Titanic_dataset=Titanic_df[repeating_sequence,]
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL
#Fitting the Naive Bayes model
#What does the model say? Print the model summary
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
Naive_Bayes_Model
#Prediction on the dataset
#Confusion matrix to check accuracy
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)
table(NB_Predictions,Titanic_dataset$Survived)
Output of Program-I:
sndcoe@sndcoe-ThinkCentre-M72e:~$ Rscript titanic.R
[1] 0 0 35 0 0 0
[1] 75 192 140 80 76 20
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.676965 0.323035
Conditional probabilities:
Class
Y 1st 2nd 3rd Crew
No 0.08187919 0.11208054 0.35436242 0.45167785
Yes 0.28551336 0.16596343 0.25035162 0.29817159
Sex
Y Male Female
No 0.91543624 0.08456376
Yes 0.51617440 0.48382560
Age
Y Child Adult
No 0.03489933 0.96510067
Yes 0.08016878 0.91983122
NB_Predictions No Yes
No 1364 362--------actual(1726)
Yes 126 349--------actual(475)
-----------------------------
P(1490) P(711)
--------------------------------------------------------------------------------------------------------------------
Explaination 35+17+118+154+387+670+4+13+89+3+5+11+13+1+13+14+57+14+75+192+140+80+76+20
//A-priori probabilities:
5+11+13+1+13+14+57+14+75+192+140+80+76+20=711/2201=0.32---yes
35+17+118+154+387+670+4+13+89+3=1490/2201=0.67---no
//Conditional probabilities:class
2nd
154+13=167/1490=0.112--no
11+13+14+80/711=0.165--yes
//Conditional probabilities:sex
male
35+118+154+387+670=1364/1490=0.91--no
5+11+13+57+14+75+192=367/711=0.51
Program-II(Applying Naive Bayes to PIMA Dibetes dataset)
mydata<-read.csv(file="/home/vnd/Desktop/dibetes/diabetes.csv",header=TRUE,sep=",")
# V of View should be capital
View(mydata)
library(caTools)
library(e1071)
#part1
#S & R should be capital in SplitRatio
temp_field <- sample.split(mydata,SplitRatio=0.7)
#70% will be in training
train <- subset(mydata,temp_field==TRUE)
#30% will be in testing
test <- subset(mydata, temp_field==FALSE)
#display few samples
head(train)
head(test)
# install Naive Bayes package i.e. e1071 and add it to the top of program library(e1071)
#part2 invoke classifier
# make a note, the class cannot be numeric, it needs to be catogarical for naive bayes
#as specified in the function, hence as.factor internally maps the 1 and 0 to catogarical value
#this will generate a model for naive bayes where the ouput class to be predicated is outcome and the data is mydata
#no other parameters are considered here as of now
#s3 methos of formula ~ is against all
my_model<- naiveBayes(as.factor(train$Outcome)~.,train)
#To see summery of the probabelities calculated-
my_model
#part3
#predicting, try putting type="class" or type="raw" after the test data
pred1<-predict(my_model,test[,-9])
#generate the confussion matrix..
table(pred1,test$Outcome,dnn=c("predicted", "Actual"))
#to save prediction
output<-cbind(test,pred1)
View(output)
[1] 0 0 35 0 0 0
[1] 75 192 140 80 76 20
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.676965 0.323035
Conditional probabilities:
Class
Y 1st 2nd 3rd Crew
No 0.08187919 0.11208054 0.35436242 0.45167785
Yes 0.28551336 0.16596343 0.25035162 0.29817159
Sex
Y Male Female
No 0.91543624 0.08456376
Yes 0.51617440 0.48382560
Age
Y Child Adult
No 0.03489933 0.96510067
Yes 0.08016878 0.91983122
NB_Predictions No Yes
No 1364 362--------actual(1726)
Yes 126 349--------actual(475)
-----------------------------
P(1490) P(711)
--------------------------------------------------------------------------------------------------------------------
Explaination 35+17+118+154+387+670+4+13+89+3+5+11+13+1+13+14+57+14+75+192+140+80+76+20
//A-priori probabilities:
5+11+13+1+13+14+57+14+75+192+140+80+76+20=711/2201=0.32---yes
35+17+118+154+387+670+4+13+89+3=1490/2201=0.67---no
//Conditional probabilities:class
2nd
154+13=167/1490=0.112--no
11+13+14+80/711=0.165--yes
//Conditional probabilities:sex
male
35+118+154+387+670=1364/1490=0.91--no
5+11+13+57+14+75+192=367/711=0.51
DESCRIPTION
Predict the onset of diabetes based on diagnostic measures.
SUMMARY
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)
Inspiration:
- some values are not in the range where they are supposed to be, should be treated as missing value.
- What kind of method is better to use to fill this type of missing value? How further clasification will be like?
- Is there any sub-groups significantly more likely to have diabetes?
Program-II(Applying Naive Bayes to PIMA Dibetes dataset)
mydata<-read.csv(file="/home/vnd/Desktop/dibetes/diabetes.csv",header=TRUE,sep=",")
# V of View should be capital
View(mydata)
library(caTools)
library(e1071)
#part1
#S & R should be capital in SplitRatio
temp_field <- sample.split(mydata,SplitRatio=0.7)
#70% will be in training
train <- subset(mydata,temp_field==TRUE)
#30% will be in testing
test <- subset(mydata, temp_field==FALSE)
#display few samples
head(train)
head(test)
# install Naive Bayes package i.e. e1071 and add it to the top of program library(e1071)
#part2 invoke classifier
# make a note, the class cannot be numeric, it needs to be catogarical for naive bayes
#as specified in the function, hence as.factor internally maps the 1 and 0 to catogarical value
#this will generate a model for naive bayes where the ouput class to be predicated is outcome and the data is mydata
#no other parameters are considered here as of now
#s3 methos of formula ~ is against all
my_model<- naiveBayes(as.factor(train$Outcome)~.,train)
#To see summery of the probabelities calculated-
my_model
#part3
#predicting, try putting type="class" or type="raw" after the test data
pred1<-predict(my_model,test[,-9])
#generate the confussion matrix..
table(pred1,test$Outcome,dnn=c("predicted", "Actual"))
#to save prediction
output<-cbind(test,pred1)
View(output)
mydata<-read.csv(file="/home/vnd/Desktop/dibetes/diabetes.csv",header=TRUE,sep=",") > > # V of View should be capital > View(mydata) > library(caTools) > library(e1071) > > #part1 > #S & R should be capital in SplitRatio > temp_field <- sample.split(mydata,SplitRatio=0.7) > > #70% will be in training > train <- subset(mydata,temp_field==TRUE) > > #30% will be in testing > test <- subset(mydata, temp_field==FALSE) > > #display few samples > head(train) Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 4 1 89 66 23 94 28.1 0.167 21 0 5 0 137 40 35 168 43.1 2.288 33 1 8 10 115 0 0 0 35.3 0.134 29 0 > head(test) Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 6 5 116 74 0 0 25.6 0.201 30 0 7 3 78 50 32 88 31.0 0.248 26 1 9 2 197 70 45 543 30.5 0.158 53 1 15 5 166 72 19 175 25.8 0.587 51 1 16 7 100 0 0 0 30.0 0.484 32 1 18 7 107 74 0 0 29.6 0.254 31 1 > # install Naive Bayes package i.e. e1071 and add it to the top of program library(e1071) > #part2 invoke classifier > > # make a note, the class cannot be numeric, it needs to be catogarical for naive bayes > #as specified in the function, hence as.factor internally maps the 1 and 0 to catogarical value > #this will generate a model for naive bayes where the ouput class to be predicated is outcome and the data is mydata > #no other parameters are considered here as of now > #s3 methos of formula ~ is against all > my_model<- naiveBayes(as.factor(train$Outcome)~.,train) > > #To see summery of the probabelities calculated- > my_model Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.default(x = X, y = Y, laplace = laplace) A-priori probabilities: Y 0 1 0.6491228 0.3508772 Conditional probabilities: Pregnancies Y [,1] [,2] 0 3.228228 3.060954 1 4.572222 3.576819 Glucose Y [,1] [,2] 0 109.5556 27.19288 1 141.7944 33.17394 BloodPressure Y [,1] [,2] 0 67.88589 19.09548 1 70.84444 20.62591 SkinThickness Y [,1] [,2] 0 19.71171 14.81374 1 20.70556 17.89938 Insulin Y [,1] [,2] 0 67.95495 101.3626 1 94.29444 138.1624 BMI Y [,1] [,2] 0 30.59399 7.968936 1 34.77778 7.175288 DiabetesPedigreeFunction Y [,1] [,2] 0 0.4385135 0.3062517 1 0.5595778 0.4073832 Age Y [,1] [,2] 0 30.81081 11.47536 1 36.93889 11.24445 > > > > #part3 > #predicting, try putting type="class" or type="raw" after the test data > pred1<-predict(my_model,test[,-9]) > > > #generate the confussion matrix.. > table(pred1,test$Outcome,dnn=c("actual", "predicted")) predicted actual 0 1 0 144 37 1 23 51 > > #to save prediction > output<-cbind(test,pred1) > View(output) | |
|
Conclusion:
Apriori probability
confusion matrix
csv
dibetes
naive bayes
probabilities
R programming language
r programming tutorial
titanic dataset
- Get link
- X
- Other Apps
Comments
Post a Comment