Experiment No: DA-(1,2)

TITLE: - Download the Iris flower dataset or any other dataset

Into a Data Frame. Use R and Perform following –

· How many features are there and what are their types (e.g., numeric, nominal)?

· Compute and display summary statistics for each feature available in the dataset.(eg. minimum value, maximum value, mean, range, standard deviation, variance And percentiles

· Data Visualization-Create a histogram for each feature in the dataset to illustrate the Feature distributions. Plot each histogram.

· Create a boxplot for each feature in the dataset. All of the boxplots should be Combined into a single plot. Compare distributions and identify outliers.

Objective:

· To study R programming language & statistic computing

· Requirements (Hw/Sw): PC, R studio, Ubuntu system,R.

Theory:-

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac. This programming language was named R, based on the first letter of first name of the two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs Language S.

Features of R

As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R −

· R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.

· R has an effective data handling and storage facility,

· R provides a suite of operators for calculations on arrays, lists, vectors and matrices.

· R provides a large, coherent and integrated collection of tools for data analysis.

· R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.

As a conclusion, R is world’s most widely used statistics programming language.

R Command Prompt

Once you have R environment setup, then it’s easy to start your R command prompt by just typing the following command at your command prompt −

$ R

This will launch R interpreter and you will get a prompt > where you can start typing your program as follows −

> myString <- "Hello, World!"

> print ( myString)

[1] "Hello, World!"

Here first statement defines a string variable myString, where we assign a string "Hello, World!" and then next statement print() is being used to print the value stored in variable myString.

R Script File

Usually, you will do your programming by writing your programs in script files and then you execute those scripts at your command prompt with the help of R interpreter called Rscript. So let's start with writing following code in a text file called test.R as under −

# My first program in R Programming

myString <- "Hello, World!"

print ( myString)

Save the above code in a file test.R and execute it at Linux command prompt as given below. Even if you are using Windows or other system, syntax will remain same.

Save the above code in a file test.R and execute it at Linux command prompt as given below. Even if you are using Windows or other system, syntax will remain same.

$ Rscript test.R

When we run the above program, it produces the following result.

[1] "Hello, World!"

Comments

Comments are like helping text in your R program and they are ignored by the interpreter while executing your actual program. Single comment is written using # in the beginning of the statement as follows −

# My first program in R Programming

R does not support multi-line comments but you can perform a trick which is something as follows −

if(FALSE) {

"This is a demo for multi-line comments and it should be put inside either a

single OR double quote"

}

myString <- "Hello, World!"

print ( myString)

[1] "Hello, World!"

Though above comments will be executed by R interpreter, they will not interfere with your actual program. You should put such comments inside, either single or double quote.

In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −

· Vectors

· Lists

· Matrices

· Arrays

· Factors

· Data Frames

The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.

What is statistical analysis?

It’s the science of collecting, exploring and presenting large amounts of data to discover underlying patterns and trends. Statistics are applied every day – in research, industry and government – to become more scientific about decisions that need to be made.

When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole.

Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation)

Data Mining Vs. Statistics

Statistics is essentially a part of the process of data mining. It is the science of learning from data. Also, it provides tools and techniques for dealing with large amounts of data. Data mining and Statistics are both part of the process of learning from data and analysing it. They both discover and analyse structures in data with the aim to transform it into information. Although, both their aims are similar they have a difference in approaches. Data mining approach ca be applied to both numeric and non-numeric data. Whereas statistics can applied over the numeric data only.

How to load Libraries and Dataset from R Only

library(datasets) #To select library

data ("iris") #to load particular dataset from library

names(iris) #to show column

dim(iris) #to show dimesions

View(iris) #to view dataset

str(iris) #internal structure of dataset

How to load Dataset from HOST MACHINE Only

mydata<-read.csv(file="/home/vnd/Desktop/dibetes diabetes.csv",header=TRUE,sep=",")

R-BOXPLOT

Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into three quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in the data set. It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.

Boxplots are created in R by using the boxplot() function.

Syntax

The basic syntax to create a boxplot in R is −

boxplot(x, data, notch, varwidth, names, main)

Following is the description of the parameters used −

· x is a vector or a formula.

· data is the data frame.

· notch is a logical value. Set as TRUE to draw a notch.

· varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size.

· names are the group labels which will be printed under each boxplot.

· main is used to give a title to the graph.

Axes and Text

Many high level plotting functions (plot, hist, boxplot, etc.) allow you to include axis and text options (as well as other graphical parameters).

For example

# Specify axis options within plot()

plot(x,y,main="title",sub="subtitle",xlab="X-axis label",ylab="y-axix label",
xlim=c(xmin, xmax), ylim=c(ymin, ymax))

R - Histograms

A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range.

R creates histogram using hist() function. This function takes a vector as an input and uses some more parameters to plot histograms.

Syntax

The basic syntax for creating a histogram using R is −

hist(v,main,xlab,xlim,ylim,breaks,col,border)

Following is the description of the parameters used −

· v is a vector containing numeric values used in histogram.

· main indicates title of the chart.

· col is used to set color of the bars.

· Border is used to set border color of each bar.

· xlab is used to give description of x-axis.

· xlim is used to specify the range of values on the x-axis.

· ylim is used to specify the range of values on the y-axis.

· breaks is used to mention the width of each bar.

In a histogram, the total range of data set (i.e from minimum value to maximum value) is divided into 8 to 15 equal parts. These equal parts are known as bins or class intervals.

Each and every observation (or value) in the data set is placed in the appropriate bin. The number of observations occupying a given bin, becomes the frequency of that bin.

Note that no overlap is allowed between the bins. Any observation can occupy one and only one bin.

We see that an object of class histogramis returned which has:

· breaks-places where the breaks occur,

· counts-the number of observations falling in that cell,

· density-the density of cells, mids-the midpoints of cells,

· xname-the x argument name and

· equidist-a logical value indicating if the breaks are equally spaced or not.

Example

A simple histogram is created using input vector, label, col and border parameters.

The script given below will create and save the histogram in the current R working directory.

# Create data for the graph.

v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.

png(file = "histogram.png")

# Create the histogram.

hist(v,xlab = "Weight",col = "yellow",border = "blue")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −

Outlier In Boxplot

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

Program-1

marks<-read.csv(file="/home/vnd/Documents/marks.csv",header=TRUE,sep=",")

class(marks)

x<-marks

x<-marks$marks

View(x)

names(x) #to show column

dim(x) #to show dimesions

str(x) #internal structure of dataset

min(x) # Minimum value in length column

max(x) # Minimum value in Width column

mean(x) # Mean value in length column

range(x) # range (from- to) in length column

str(x) #standard deviation

var(x) #variance

#HISTOGRAM:

#to display the details of histogram

hist(x)

h<-hist(x,main="Insemester exam marks of 1-20 roll no",xlab="Marks",col=c("red","blue"))

h<-hist(x,main="Insemester exam marks of 1-20 roll no",xlab="Marks",col=c("red","blue"),freq=F)

h<-hist(x,breaks=10,main="Insemester exam marks of 1-20 roll no",xlab="Marks",col=c("red","blue"))

#BOXPLOT:

boxplot(x)

boxplot(x,horizontal=T)

boxplot(x,horizontal=T,main="insem marks",xlab="marks",col=c("red"))

summary(x)

IQR(x)

Dataset:

1	15
2	17
3	15
4	18
5	23
6	21
7	18
8	25
9	23
10	19
11	21
12	17
13	17
14	24
15	8

Program-02

library(datasets) #To select library

data ("iris") #to load particular dataset from library

names(iris) #to show column

dim(iris) #to show dimesions

View(iris) #to view dataset

str(iris) #internal structure of dataset

min(iris$Sepal.Length) # Minimum value in length column

max(iris$Sepal.Width) # Maximum value in Width column

mean(iris$Sepal.Length) # Mean value in length column

range(iris$Sepal.Length) # range (from- to) in length column

str(iris$Sepal.Length) #standard deviation

var(iris$Sepal.Length) #variance

#using hist function

h<-hist (iris$Sepal.Length,main="sepal length frequencies-histogram", xlab="sepal length", xlim=c(3.5,8.5),col="blue")

#to display the details of histogram

#using breaks and las

h<-hist(iris$Sepal.Length, main="sepal length frequencies- histogram", xlab="sepal length", col="red", labels=TRUE, breaks = 20, border="green", las=3)

#Write breaks in following way as sometimes for fine details, R doesnt show by simply writing breaks=12, you need to specify the vector

h<-hist(iris$Sepal.Length, breaks= c(4.3, 4.6, 4.9, 5.2, 5.5, 5.8, 6.1, 6.4, 6.7, 7.0, 7.3, 7.6, 7.9))

#using boxplot() functon

boxplot(iris$Sepal.Length)

#this will display the summery-the quartile, median, min, max,...

summary(iris$Sepal.Length)

#combined boxplot for all features -5 because we are omitting 6th feature (column)

myboxplot<-boxplot(iris[,-5])

boxplot(iris$Sepal.Length,horizontal=T)

boxplot(iris$Petal.Length,horizontal=F)

boxplot(iris$Petal.Width,horizontal=T,main="insem marks",xlab="marks",col=c("red"))

How to Install ‘R’ On Ubuntu

#The following commands needs to be run to install R

# Download and Install RStudio

sudo apt-get update

sudo apt-get install r-base

sudo apt-get install gdebi-core

wget https://download1.rstudio.org/rstudio-1.0.44-amd64.deb

sudo gdebi rstudio-1.0.44-amd64.deb

rm rstudio-1.0.44-amd64.deb

#If you get Error Like:apt-package not found use following steps for solution.
sudo update-alternatives --config python3
then select alternative 1.

Practice Program On 'R'

Load the USArrests dataset Into a Data Frame. Use R and Perform following –

·        How many features are there and what are their types (e.g., numeric, nominal)?

·        Compute and display summary statistics for each feature available in the dataset.(eg. minimum value, maximum value, mean, range, standard deviation, variance And percentiles

·        Data Visualization-Create a histogram for each feature in the dataset to illustrate the Feature distributions. Plot each histogram.

·        Create a boxplot for each feature in the dataset. All of the boxplots should be Combined into a single plot. Compare distributions and identify outliers.

USArrests

This data set contains statistics about violent crime rates by us state.

data("USArrests")
     
head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

Murder: Murder arrests (per 100,000)
Assault: Assault arrests (per 100,000)
UrbanPop: Percent urban population
Rape: Rape arrests (per 100,000)

Conclusion:

Harry jackJanuary 6, 2020 at 6:04 PM
This can be quite certain for various plants or even various cultivators of a similar plant.beautiful flowers
SamiullahJanuary 13, 2020 at 7:47 PM
This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. Send Flowers Internationally – FloraQueen
SamiullahJanuary 13, 2020 at 7:47 PM
This comment has been removed by the author.

Search This Blog

...........................

SPOS LAB(GROUP C-01) Implement UNIX system calls like ps, fork, join, exec family, and wait for process management (use shell script/ Java/ C programming)

Study 'R' Programming Platform and Perform various operations on Iris Flower dataset.

Features of R

R Command Prompt

R Script File

Comments

Syntax

Axes and Text

R - Histograms

Syntax

Example

USArrests

Comments

Post a Comment

Popular Posts

Vivo Z1 Pro will be the first smartphone that comes with 712 chipset, Learn the specialty of this processor.

System Programming & Operating system Program With Outputs(SPOS)

What is Prolog? "How To Install Prolog on Ubuntu"?