Study 'R' Programming Platform and Perform various operations on Iris Flower dataset.
- Get link
- X
- Other Apps
Experiment
No: DA-(1,2)
TITLE:
- Download the Iris flower dataset or any other dataset
·
How many features are
there and what are their types (e.g., numeric, nominal)?
·
Compute and display
summary statistics for each feature available in the dataset.(eg.
minimum value, maximum value, mean, range, standard deviation, variance And
percentiles
·
Data
Visualization-Create a histogram for each feature in the dataset to illustrate
the Feature
distributions. Plot each histogram.
·
Create a boxplot for
each feature in the dataset. All of the boxplots should be Combined
into a single plot. Compare distributions and identify outliers.
Objective:
·
To study R programming
language & statistic computing
·
Requirements (Hw/Sw):
PC, R studio, Ubuntu system,R.
Theory:-
R is a programming language and
software environment for statistical analysis, graphics representation and
reporting. R was created by Ross Ihaka and Robert Gentleman at the University
of Auckland, New Zealand, and is currently developed by the R Development Core
Team. R is freely available under the GNU General Public License, and
pre-compiled binary versions are provided for various operating systems like
Linux, Windows and Mac. This programming language was named R, based on the first letter of first
name of the two R authors (Robert Gentleman and Ross Ihaka), and partly a play
on the name of the Bell Labs Language S.
Features
of R
As
stated earlier, R is a programming language and software environment for
statistical analysis, graphics representation and reporting. The following are
the important features of R −
·
R is a well-developed,
simple and effective programming language which includes conditionals, loops,
user defined recursive functions and input and output facilities.
·
R has an effective data
handling and storage facility,
·
R provides a suite of
operators for calculations on arrays, lists, vectors and matrices.
·
R provides a large,
coherent and integrated collection of tools for data analysis.
·
R provides graphical
facilities for data analysis and display either directly at the computer or
printing at the papers.
As
a conclusion, R is world’s most widely used statistics programming language.
R
Command Prompt
Once
you have R environment setup, then it’s easy to start your R command prompt by
just typing the following command at your command prompt −
$ R
This
will launch R interpreter and you will get a prompt > where you can start
typing your program as follows −
>
myString <- "Hello, World!"
>
print ( myString)
[1] "Hello, World!"
Here
first statement defines a string variable myString, where we assign a string
"Hello, World!" and then next statement print() is being used to
print the value stored in variable myString.
R
Script File
Usually,
you will do your programming by writing your programs in script files and then
you execute those scripts at your command prompt with the help of R interpreter
called Rscript. So let's start with
writing following code in a text file called test.R as under −
#
My first program in R Programming
myString <- "Hello,
World!"
print
( myString)
Save
the above code in a file test.R and execute it at Linux command prompt as given
below. Even if you are using Windows or other system, syntax will remain same.
Save
the above code in a file test.R and execute it at Linux command prompt as given
below. Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R
When
we run the above program, it produces the following result.
[1] "Hello, World!"
Comments
Comments
are like helping text in your R program and they are ignored by the interpreter
while executing your actual program. Single comment is written using # in the
beginning of the statement as follows −
# My first program in R Programming
R
does not support multi-line comments but you can perform a trick which is
something as follows −
if(FALSE)
{
"This is a demo for multi-line comments
and it should be put inside either a
single OR double quote"
}
myString
<- "Hello, World!"
print ( myString)
[1] "Hello, World!"
Though
above comments will be executed by R interpreter, they will not interfere with
your actual program. You should put such comments inside, either single or
double quote.
In
contrast to other programming languages like C and java in R, the variables are
not declared as some data type. The variables are assigned with R-Objects and
the data type of the R-object becomes the data type of the variable. There are
many types of R-objects. The frequently used ones are −
·
Vectors
·
Lists
·
Matrices
·
Arrays
·
Factors
·
Data Frames
The
simplest of these objects is the vector
object and there are six data types of these atomic vectors, also termed as
six classes of vectors. The other R-Objects are built upon the atomic vectors.
It’s
the science of collecting, exploring and presenting large amounts of data to
discover underlying patterns and trends. Statistics are applied every day – in
research, industry and government – to become more scientific about decisions
that need to be made.
When
census data cannot be
collected, statisticians
collect data by developing specific experiment designs and survey samples.
Representative sampling assures that inferences and conclusions can reasonably
extend from the sample to the population as a whole.
Two
main statistical methods are used in data analysis: descriptive
statistics, which summarize data from a sample using indexes
such as the mean
or standard
deviation, and inferential
statistics, which draw conclusions from data that are subject to
random variation (e.g., observational errors, sampling variation)
Data
Mining Vs. Statistics
Statistics is essentially a part of the process of data mining.
It is the science of learning from data. Also, it provides tools and techniques
for dealing with large amounts of data. Data mining and Statistics are both part
of the process of learning from data and analysing it. They both discover and
analyse structures in data with the aim to transform it into information.
Although, both their aims are similar they have a difference in approaches.
Data mining approach ca be applied to both numeric and non-numeric data.
Whereas statistics can applied over the numeric data only.
How to load Libraries and
Dataset from R Only
library(datasets) #To select
library
data
("iris") #to
load particular dataset from library
names(iris) #to show column
dim(iris) #to show dimesions
View(iris) #to view dataset
str(iris) #internal structure of dataset
How to load Dataset from
HOST MACHINE Only
mydata<-read.csv(file="/home/vnd/Desktop/dibetes
diabetes.csv",header=TRUE,sep=",")
R-BOXPLOT
Boxplots are a measure of how well distributed is the data in a
data set. It divides the data set into three quartiles. This graph represents
the minimum, maximum, median, first quartile and third quartile in the data set.
It is also useful in comparing the distribution of data across data sets by
drawing boxplots for each of them.
Boxplots
are created in R by using the boxplot()
function.
Syntax
The
basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names,
main)
Following
is the description of the parameters used −
·
x
is a vector or a formula.
·
data
is the data frame.
·
notch
is a logical value. Set as TRUE to draw a notch.
·
varwidth
is a logical value. Set as true to draw width of the box proportionate to the
sample size.
·
names
are the group labels which will be printed under each boxplot.
·
main
is used to give a title to the graph.
Axes and Text
Many
high level plotting functions (plot, hist, boxplot, etc.) allow you to include
axis and text options (as well as other graphical
parameters).
For
example
# Specify axis options within
plot()
plot(x,y,main="title",sub="subtitle",xlab="X-axis label",ylab="y-axix label",
xlim=c(xmin, xmax), ylim=c(ymin, ymax))
R -
Histograms
A histogram represents the frequencies
of values of a variable bucketed into ranges. Histogram is similar to bar chat
but the difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in that range.
R
creates histogram using hist()
function. This function takes a vector as an input and uses some more
parameters to plot histograms.
Syntax
The
basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following
is the description of the parameters used −
·
v
is a vector containing numeric values used in histogram.
·
main
indicates title of the chart.
·
col
is used to set color of the bars.
·
Border
is used to set border color of each bar.
·
xlab
is used to give description of x-axis.
·
xlim
is used to specify the range of values on the x-axis.
·
ylim
is used to specify the range of values on the y-axis.
·
breaks
is used to mention the width of each bar.
In a histogram, the total range of data set (i.e
from minimum value to maximum value) is divided into 8 to 15 equal parts. These
equal parts are known as bins
or class intervals.
Each and every observation (or value) in the data set is placed in the
appropriate bin. The number of observations occupying a given bin, becomes the frequency of that bin.
Note that no overlap is allowed between the bins. Any observation can occupy one and only one bin.
We
see that an object of class
histogram
is returned which has:
·
breaks
-places where the breaks occur,
·
counts
-the number of observations falling in that
cell,
·
density
-the density of cells, mids
-the midpoints of cells,
·
xname
-the x argument name and
·
equidist
-a logical value indicating if the breaks are
equally spaced or not.
Example
A
simple histogram is created using input vector, label, col and border
parameters.
The
script given below will create and save the histogram in the current R working
directory.
#
Create data for the graph.
v
<- c(9,13,21,8,36,22,12,41,31,33,19)
#
Give the chart file a name.
png(file
= "histogram.png")
#
Create the histogram.
hist(v,xlab
= "Weight",col = "yellow",border = "blue")
#
Save the file.
dev.off()
When
we execute the above code, it produces the following result −
Outlier In
Boxplot
An outlier is an
observation that is numerically distant from the rest of the data. When
reviewing a boxplot, an outlier is defined as a data point that is located
outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the
interquartile range above the upper quartile and bellow the lower quartile).
Program-1
marks<-read.csv(file="/home/vnd/Documents/marks.csv",header=TRUE,sep=",")
class(marks)
x<-marks
x<-marks$marks
x
View(x)
names(x) #to
show column
dim(x) #to
show dimesions
str(x) #internal
structure of dataset
min(x) # Minimum
value in length column
max(x) # Minimum
value in Width column
mean(x) # Mean value in length
column
range(x) # range (from- to) in
length column
str(x) #standard
deviation
var(x) #variance
#HISTOGRAM:
#to display the details
of histogram
h
hist(x)
h<-hist(x,main="Insemester
exam marks of 1-20 roll
no",xlab="Marks",col=c("red","blue"))
h<-hist(x,main="Insemester
exam marks of 1-20 roll
no",xlab="Marks",col=c("red","blue"),freq=F)
h<-hist(x,breaks=10,main="Insemester
exam marks of 1-20 roll
no",xlab="Marks",col=c("red","blue"))
#BOXPLOT:
boxplot(x)
boxplot(x,horizontal=T)
boxplot(x,horizontal=T,main="insem
marks",xlab="marks",col=c("red"))
summary(x)
IQR(x)
Dataset:
1
|
15
|
2
|
17
|
3
|
15
|
4
|
18
|
5
|
23
|
6
|
21
|
7
|
18
|
8
|
25
|
9
|
23
|
10
|
19
|
11
|
21
|
12
|
17
|
13
|
17
|
14
|
24
|
15
|
8
|
Program-02
library(datasets) #To select library
data ("iris") #to load
particular dataset from library
names(iris) #to
show column
dim(iris) #to
show dimesions
View(iris) #to
view dataset
str(iris) #internal structure of
dataset
min(iris$Sepal.Length) # Minimum value
in length column
max(iris$Sepal.Width) # Maximum value
in Width column
mean(iris$Sepal.Length) # Mean value in length
column
range(iris$Sepal.Length) # range (from- to) in length
column
str(iris$Sepal.Length) #standard
deviation
var(iris$Sepal.Length) #variance
#using hist function
h<-hist
(iris$Sepal.Length,main="sepal length frequencies-histogram",
xlab="sepal length", xlim=c(3.5,8.5),col="blue")
#to display the details of histogram
h
#using breaks and las
h<-hist(iris$Sepal.Length,
main="sepal length frequencies- histogram", xlab="sepal
length", col="red", labels=TRUE, breaks = 20,
border="green", las=3)
#Write breaks in
following way as sometimes for fine details, R doesnt show by simply writing
breaks=12, you need to specify the vector
h<-hist(iris$Sepal.Length,
breaks= c(4.3, 4.6, 4.9, 5.2, 5.5, 5.8, 6.1, 6.4, 6.7, 7.0, 7.3, 7.6, 7.9))
#using boxplot()
functon
boxplot(iris$Sepal.Length)
#this will display the
summery-the quartile, median, min, max,...
summary(iris$Sepal.Length)
#combined boxplot for
all features -5 because we are omitting 6th feature (column)
myboxplot<-boxplot(iris[,-5])
boxplot(iris$Sepal.Length,horizontal=T)
boxplot(iris$Petal.Length,horizontal=F)
boxplot(iris$Petal.Width,horizontal=T,main="insem
marks",xlab="marks",col=c("red"))
How to Install ‘R’ On
Ubuntu
#The following
commands needs to be run to install R
# Download and
Install RStudio
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install
gdebi-core
wget
https://download1.rstudio.org/rstudio-1.0.44-amd64.deb
sudo gdebi
rstudio-1.0.44-amd64.deb
rm
rstudio-1.0.44-amd64.deb
#If you get Error Like:apt-package not found use following steps for solution.
sudo update-alternatives --config python3
then select alternative 1.
sudo update-alternatives --config python3
then select alternative 1.
Practice Program On 'R'
Load the USArrests dataset Into a Data Frame. Use R and Perform following –
· How many features are there and what are their types (e.g., numeric, nominal)?
· Compute and display summary statistics for each feature available in the dataset.(eg. minimum value, maximum value, mean, range, standard deviation, variance And percentiles
· Data Visualization-Create a histogram for each feature in the dataset to illustrate the Feature distributions. Plot each histogram.
· Create a boxplot for each feature in the dataset. All of the boxplots should be Combined into a single plot. Compare distributions and identify outliers.
USArrests
This data set contains statistics about violent crime rates by us state.
data("USArrests")
head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
- Murder: Murder arrests (per 100,000)
- Assault: Assault arrests (per 100,000)
- UrbanPop: Percent urban population
- Rape: Rape arrests (per 100,000)
Conclusion:
boxplot
histogram
iris flower dataset
matlab (programming language)
outlier
r programming
r programming coursera assignment 1
r programming tutorial
- Get link
- X
- Other Apps
Comments
This can be quite certain for various plants or even various cultivators of a similar plant.beautiful flowers
ReplyDeleteThis is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. Send Flowers Internationally – FloraQueen
ReplyDeleteThis comment has been removed by the author.
ReplyDelete