Introduction to R

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. The purpose of this notebook is to enable you to play around with basic functionalities of R.

The website you see now is a rendered HTML file. If you want to play with the .Rmd file, you can go to the original repository, download the .Rmd file and run it on RStudio.

Data Structures of R

first_variable = 'This is my first R variable.'
print(first_variable)

## [1] "This is my first R variable."

R is a powerful analytical language that is very simple to use. Although it isn’t without its quirks so we are going to look at its data types and simple functionalities first.

Scalar

Scalar is a variable with single value assigned to it. In R, there are no singly-valued variables (i.e. scalars), instead scalars in R are simply 0-dimensional vectors.

There are various types in R such as character, numeric (float), integer etc. Let’s create example variables to demonstrate.

# An integer. Also you can use comments with #
lecture_code = 709

# A character variable
lecture_name = 'Introduction to Data Science'

# A float
random_float = 1.2

Vector

Vector is an R structure that holds values of the same type (e.g. numeric, character). Vectors are 1-dimensional data structures. The only dimension they have is called “length”.

There are several ways to assign vectors to a variable. They are listed below and they are all identical:

# An integer vector
dept_lecture_code = c(901, 580)

# A character vector
dept_lecture_name = c("Information Systems", "Introduction to Data Science")

You can also use assign function to create vectors:

assign("lecture_prof_assistant",c("Tugba Taşkaya Temizel","Mehmet Ali Akyol"))

You can concatenate vectors and assign to a new vector:

codes_names_vector = c(dept_lecture_code, dept_lecture_name, lecture_prof_assistant)

As you can see c function stands for ‘combine’. It can combine explicit values or different vectors together to form a new vector.

Generating Sequences and Assigning to Vectors

Sequences are numerical values that follow a certain rule such as numbers from 1 to 30. In R, there are two ways to generate sequences: (1) using : operator and (2) seq command.

: operator generates a sequence of numbers with 1 difference between them. For instance, 1:30 is the vector c(1,2,…,30). The colon operator has priority within expression. For example, 2*1:15 is the vector c(2,4,…,28,30). 30:1 generates sequence backwards.

#generating sequence with : operator
one_to_thirty=1:30
print(one_to_thirty)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30

#the priority of : operator
even_numbers_to_thirty=2*1:15
print(even_numbers_to_thirty)

##  [1]  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30

#backward sequence
backward_sequence=30:1
print(backward_sequence)

##  [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8
## [24]  7  6  5  4  3  2  1

seq command is a more general way to create sequences. It has 5 arguments, only some of which may be specified in any one call. The first two arguments, if given, specify the beginning and end of the sequence. If these are the only two arguments given, the result is the same as the colon operator.

#generating sequence with seq command
seq(1,30)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30

The first two arguments can be named from=value and to=value. See the examples below:

#generating sequence with named arguments in seq command
seq(from=1,to=30)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30

seq(to=30,from=1)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30

The next two arguments to seq() command may be named as by=value and length=value which specify the step size and the length for the sequence respectively. If neither of these arguments is given, the default by=1 is assumed.

#by and length arguments of seq command
seq(from=1,to=30,by=2)

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29

seq(from=1,length=50,by=.2)

##  [1]  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2  3.4  3.6
## [15]  3.8  4.0  4.2  4.4  4.6  4.8  5.0  5.2  5.4  5.6  5.8  6.0  6.2  6.4
## [29]  6.6  6.8  7.0  7.2  7.4  7.6  7.8  8.0  8.2  8.4  8.6  8.8  9.0  9.2
## [43]  9.4  9.6  9.8 10.0 10.2 10.4 10.6 10.8

The fifth argument is named along=vector, which is normally used as the only argument to create the sequence 1,2,…,length(vector), or the empty sequence if the vector is empty.

#generates sequence from 1 to 30 as the length of backward_sequence is equal to 30
seq(from=1, along=backward_sequence)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30

A related function is rep() which can be used for replicating an object in various ways.

#replicate dept_lecture_code vector 5 times
rep(dept_lecture_code,times=5)

##  [1] 901 580 901 580 901 580 901 580 901 580

#replicate each element of dept_lecture_code vector before moving on to the next 
rep(dept_lecture_code,each=5)

##  [1] 901 901 901 901 901 580 580 580 580 580

Logical Vectors

As well as numerical vectors, R allows manipulation of logical quantities. The elements of a logical vector can have the values TRUE, FALSE, or NA (for ?not available?, see Missing Values). The first two are often abbreviated as T and F, respectively. Note that T and F are just variables which are set to TRUE and FALSE by default, but are not reserved for words and hence can be overwritten by the user. Hence, you should always use TRUE and FALSE.

Logical vectors are generated by conditions.

#generating logical vector by condition
logical_vector=one_to_thirty>13
print(logical_vector)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

As you can see, logical_vector is a vector with the same length as one_to_thirty with values FALSE corresponding to the elements of one_to_thirty where the condition is not met and TRUE where it is.

The logical operators are <, <=, >, >=, == for exact equality and != for inequality. In addition, if c1 and c2 are logical expressions, then c1 & c2 is their intersection (?and?), c1 | c2 is their union (?or?), and !c1 is the negation of c1.

Logical vectors may be used in ordinary arithmetic, in which case they are coerced into numeric vectors, FALSE becoming 0 and TRUE becoming 1. However there are situations where logical vectors and their coerced numeric counterparts are not equivalent, for example see the next subsection.

Missing Values

In some cases the components of a vector may not be completely known. When an element or value is ?not available? or a ?missing value? in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general any operation on an NA becomes an NA. The motivation for this rule is simply that if the specification of an operation is incomplete, the result cannot be known, hence, is not available.

The function is.na(x) gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA.

x=c(1:3,NA)
ind=is.na(x)
print(ind)

## [1] FALSE FALSE FALSE  TRUE

Notice that the logical expression x == NA is quite different from is.na(x) since NA is not really a value but a marker for a quantity that is not available. Thus x == NA is a vector of the same length as x all of whose values are NA as the logical expression itself is incomplete and hence undecidable.

Note that there is a second kind of ?missing? values which are produced by numerical computation, the so-called Not a Number, NaN, values. Examples are given below. They give NaN since the result cannot be defined.

0/0

## [1] NaN

#or
Inf-Inf

## [1] NaN

In summary, is.na(x) is TRUE both for NA and NaN values. To differentiate these, is.nan(x) is only TRUE for NaNs.

Missing values are sometimes printed as <NA> when character vectors are printed without quotes.

Character Vectors

Character quantities and character vectors are used frequently in R, for example as plot labels. Where needed they are denoted by a sequence of characters delimited by the double quote character, e.g., “x-values”, “New iteration results”.

Character strings are entered using either matching double (“) or single (?) quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using \ as the escape character, so \\ is entered and printed as \\, and inside double quotes” is entered as \". Other useful escape sequences are \n, newline, \t, tab and \b, backspace–see ?Quotes for a full list.

Character vectors may be concatenated into a vector by the c() function; examples of their use will emerge frequently.

The paste() function takes an arbitrary number of arguments and concatenates them one by one into character strings. Any numbers given among the arguments are coerced into character strings in the evident way, that is, in the same way they would be if they were printed. The arguments are by default separated in the result by a single blank character, but this can be changed by the named argument, sep=string, which changes it to string, possibly empty.

labs=paste(c("X","Y"),1:10,sep="")
print(labs)

##  [1] "X1"  "Y2"  "X3"  "Y4"  "X5"  "Y6"  "X7"  "Y8"  "X9"  "Y10"

Note particularly that recycling of short lists takes place here too; thus c("X", "Y") is repeated 5 times to match the sequence 1:10.

Index Vectors

Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets. More generally any expression that evaluates to a vector may have subsets of its elements similarly selected by appending an index vector in square brackets immediately after the expression.

Such index vectors can be any of four distinct types.

A logical vector: In this case the index vector is recycled to the same length as the vector from which elements are to be selected. Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

#a vector containing non-missing values of one_to_thirty
y=one_to_thirty[!is.na(one_to_thirty)]

Note that if one_to_thirty has missing values, y will be shorter than one_to_thirty.

#a vector containing the values of one_to_thirty which are non-missing and positive
z=(one_to_thirty+1)[(!is.na(one_to_thirty))&one_to_thirty>0]

A vector of positive integral quantities: In this case the values in the index vector must lie in the set {1, 2, . . . , length(x)}. The corresponding elements of the vector are selected and concatenated, in that order, in the result. The index vector can be of any length and the result is of the same length as the index vector. For example x[6] is the sixth component of x.

#select the first 10 elements
one_to_thirty[1:10]

##  [1]  1  2  3  4  5  6  7  8  9 10

A vector of negative integral quantities: Such an index vector specifies the values to be excluded rather than included.

#discard the first 5 elements
one_to_thirty[-(1:5)]

##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30

A vector of character strings: This possibility only applies where an object has a names attribute to identify its components. In this case a sub-vector of the names vector may be used in the same way as the positive integral labels in item 2 further above.

fruit=c(5,10,1,20)
names(fruit)=c("orange","banana","apple","peach")
lunch=fruit[c("apple","orange")]
print(lunch)

##  apple orange 
##      1      5

The advantage is that alphanumeric names are often easier to remember than numeric indices. This option is particularly useful in connection with data frames, as we will see later.

An indexed expression can also appear on the receiving end of an assignment, in which case the assignment operation is performed only on those elements of the vector. The expression must be of the form vector[index_vector] as having an arbitrary expression in place of the vector name does not make much sense here.

#replace missing values with zeros
one_to_thirty[is.na(one_to_thirty)]=0

Matrices and Arrays

Arrays are data n-dimensional data structures that hold the data of the same type. Arrays have dimension (dim) attribute. Matrices are two-dimensional arrays.

#define an array
z=array(1:24,dim=c(3,4,2))
print(z)

## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

Indexing

Individual elements of an array may be referenced by giving the name of the array followed by the subscripts in square brackets, separated by commas.

More generally, subsections of an array may be specified by giving a sequence of index vectors in place of subscripts; however if any index position is given an empty index vector, then the full range of that subscript is taken.

Continuing the previous example, z[2,,] is a 4x2 array with dimension vector c(4,2).

#values of the z[2,,] array
c(z[2,1,1], z[2,2,1], z[2,3,1], z[2,4,1], z[2,1,2], z[2,2,2], z[2,3,2], z[2,4,2])

## [1]  2  5  8 11 14 17 20 23

z[,,] stands for the entire array which is the same as omitting the subscripts entirely and using z alone.

If an array name is given with just one subscript or index vector, then the corresponding values of the data vector only are used; in this case the dimension vector is ignored. This is not the case, however, if the single index is not a vector but itself an array, as we next discuss.

Index Matrices

As well as an index vector in any subscript position, a matrix may be used with a single index matrix in order either to assign a vector of quantities to an irregular collection of elements in the array, or to extract an irregular collection as a vector.

A matrix example makes it clear. In the case of a doubly indexed array, an index matrix may be given consisting of two columns and as many rows as desired. The entries in the index matrix are the row and column indices for the doubly indexed array. Suppose for example we have a 4 by 5 array X and we wish to do the following:

Extract elements X[1,3], X[2,2] and X[3,1] as a vector
Replace those entries with zeros in the array of X

x=array(1:20,dim=c(4,5)) #generate a 4x5 array
x

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20

i=array(c(1:3,3:1),dim=c(3,2)) #i is a 3x2 index array
i

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    2
## [3,]    3    1

#extract those elements from x
x[i]

## [1] 9 6 3

#replace the elements selected with zeros
x[i]=0
x

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    0   13   17
## [2,]    2    0   10   14   18
## [3,]    0    7   11   15   19
## [4,]    4    8   12   16   20

Negative indices are not allowed in index matrices. NA and zero values are allowed: rows in the index matrix containing a zero are ignored, and rows containing an NA produce an NA in the result.

Matrix Multiplication

The operator %*% is used for matrix multiplication. An n by 1 or 1 by n matrix may of course be used as an n-vector if in the context such is appropriate. Conversely, vectors which occur in matrix multiplication expressions are automatically promoted either to row or column vectors, whichever is multiplicatively coherent, if possible.

#A and B are the square matrices of the same size
A=matrix(1:4,nrow=2,ncol=2)
B=matrix(5:8,nrow=2,ncol=2)
#element by element products
A*B

##      [,1] [,2]
## [1,]    5   21
## [2,]   12   32

#matrix product
A%*%B

##      [,1] [,2]
## [1,]   23   31
## [2,]   34   46

The function crossprod() forms ?cross products?, meaning that crossprod(A, B) is the same as t(A) %*% B but the operation is more efficient. If the second argument of crossprod() is omitted, it is taken to be the same as the first.

#cross products of A and B with crossprod function
crossprod(A,B)

##      [,1] [,2]
## [1,]   17   23
## [2,]   39   53

The meaning of diag() depends on its argument. diag(v), where v is a vector, gives a diagonal matrix with elements of the vector as the diagonal entries. On the other hand diag(M), where M is a matrix, gives the vector of main diagonal entries of M. Also, somewhat confusingly, if k is a single numeric value then diag(k) is the k by k identity matrix!

#diagonal matrix of a vector
diag(dept_lecture_code)

##      [,1] [,2]
## [1,]  901    0
## [2,]    0  580

#diagonal elements of a matrix
diag(A)

## [1] 1 4

#diag function with numeric value
d=2
diag(d) #produces identity matrix with dimension d

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Combining Matrices

In R, cbind() forms matrices by binding together matrices horizontally, or column-wise, and rbind() vertically, or row-wise. The arguments to cbind() must be either vectors of any length, or matrices with the same column size, that is the same number of rows.

#column-wise binding
cbind(A,B)

##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

#row-wise binding
rbind(A,B)

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## [3,]    5    7
## [4,]    6    8

Lists

Lists are the R objects which contain elements of different types like − numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements.

x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
str(x)

## List of 3
##  $ a: num 2.5
##  $ b: logi TRUE
##  $ c: int [1:3] 1 2 3

Accessing elements in a list

Lists can be accessed in similar fashion to vectors. Integer, logical or character vectors can be used for indexing. Followings are some of the different methods to accessing an element in the list.

x["a"] # give us a sublist not the content inside

## $a
## [1] 2.5

x[["a"]] # to retrieve the content

## [1] 2.5

x$a # same as x[["a"]]

## [1] 2.5

An R list is an object consisting of an ordered collection of objects known as its components.

There is no particular need for the components to be of the same mode or type, and, for example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on. Here is a simple example of how to make a list:

#A list
family=list(name="Fred",wife="Mary",noChildren=3,childAges=c(4,7,9))

Components are always numbered and may always be referred to as such. Thus if family is the name of a list with four components, these may be individually referred to as

#Components of a list
family[[1]]

## [1] "Fred"

family[[2]]

## [1] "Mary"

family[[3]]

## [1] 3

family[[4]]

## [1] 4 7 9

family[[4]][[2]] #gives the second element of the 4th component

## [1] 7

length(family) #the number of components of the list

## [1] 4

Components of lists may also be named, and in this case the component may be referred to either by giving the component name as a character string in place of the number in double square brackets, or, more conveniently, by giving an expression of the form name$component_name

#the expressions below give the same result
family$name #calling with component name

## [1] "Fred"

family[[1]] #calling with index

## [1] "Fred"

#calling the vector elements in the list
family$childAges[1]

## [1] 4

family[[4]][[1]]

## [1] 4

Data Frame

A data frame is a list with class “data.frame”. There are restrictions on lists that may be made into data frames, namely

The components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frames.
Matrices, lists, and data frames provide as many variables to the new data frame as they have columns, elements, or variables, respectively.
Numeric vectors, logicals and factors are included as is, and by default1 character vectors are coerced to be factors, whose levels are the unique values appearing in the vector.
Vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size.

A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions.

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Creating a data frame from vectors

employee <- c('John Doe','Peter Gynn','Jolie Hope') #vector
salary <- c(21000, 23400, 26800) #vector
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14')) #vector

data = data.frame(employee, salary, startdate) # data frame
data

##     employee salary  startdate
## 1   John Doe  21000 2010-11-01
## 2 Peter Gynn  23400 2008-03-25
## 3 Jolie Hope  26800 2007-03-14

# A data frame
codes_names_df = data.frame(dept_lecture_code, dept_lecture_name)

Reading Data from Files

Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. R input facilities are simple and their requirements are fairly strict and even rather inflexible. There is a clear presumption by the designers of R that you will be able to modify your input files using other tools, such as file editors or Perl1 to fit in with the requirements of R. Generally this is very simple.

If variables are to be held mainly in data frames, as we strongly suggest they should be, an entire data frame can be read directly with the read.table() function. There is also a more primitive input function, scan(), that can be called directly.

One of the main methods to read a data from files is to use built-in read.csv method. It basically reads the .csv files. You can set whether to collect the header (the first row in the .csv file) or the line seperator in the file as options.

data = read.csv("cars.csv", header = T, sep = ";")
head(data)

##                         Car    MPG Cylinders Displacement Horsepower
## 1                    STRING DOUBLE       INT       DOUBLE     DOUBLE
## 2 Chevrolet Chevelle Malibu   18.0         8        307.0      130.0
## 3         Buick Skylark 320   15.0         8        350.0      165.0
## 4        Plymouth Satellite   18.0         8        318.0      150.0
## 5             AMC Rebel SST   16.0         8        304.0      150.0
## 6               Ford Torino   17.0         8        302.0      140.0
##   Weight Acceleration Model Origin
## 1 DOUBLE       DOUBLE   INT    CAT
## 2  3504.         12.0    70     US
## 3  3693.         11.5    70     US
## 4  3436.         11.0    70     US
## 5  3433.         12.0    70     US
## 6  3449.         10.5    70     US

For more details on importing data into R and also exporting data, see the R Data Import/Export manual.

Accessing built-in data sets

Around 100 datasets are supplied with R (in package datasets), and others are available in packages (including the recommended packages supplied with R). To see the list of datasets currently available use data().

# Iris is an existing dataset in R. You can load it directly
data(iris)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

str function outputs the names, types and first few entries of a data frame. This allows you to take a quick look at the data. As you can see, ‘Species’ variable in the data frame is a new data type called “Factor”. This is essentially a character variable with known amount of different values. This allows R to run count-type statistics on it. Most character variables are represented as factors by default when loading into data frames unless explicitly specified. We will talk more about factors later.

Basic Descriptive Statistics

Summary function outputs basic statistics regarding the data frame. For numeric variables, these statistics describe the five-point statistics. For factor variables, this is of count-type. If we represented our character variable as character type instead of factor, that would be impossible.

iris$Species = as.character(iris$Species)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    Species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

I did two things in the previous chunk. First, I accessed the Species variable, and casted it into type character. Then, I took the summary of the data frame once more. $ operator is the simplest way to index a variable in R. However it’s not fool-proof. The recommended way to index a variable is as follows, with double brackets and quote-marked-variable-names:

data(iris) # Reload original iris data
iris[["Species"]] = as.character(iris[["Species"]])
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    Species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

You can perform mathematical computations on numeric variables and create new ones using the same syntax.

iris[["New_Numeric_Var_1"]] = iris[["Sepal.Length"]] + iris[["Sepal.Width"]]
iris$New_Numeric_Var_2 = iris$Sepal.Length/(iris$Sepal.Width+0.00001)

In the first line, I indexed the variables using the prefered method and made a new variable. In the second line, I used the simple method and made a new variable. I also added a scalar to all values in Sepal.Width variable. R is a very flexible language. You can do most computations without explicitly casting the data into the same dimension, however this may cause issues. I won’t demonstrate this for now, but make sure to regularly check your data for any inconsistencies that this flexibility might cause.

Save / Load Environment Objects in R

You can serialize and save any object or even the whole environment in R. For this, you can use save.image function.

After we had created some objects into the R environment, you can use it to save the entire list of environment objects to a file. By using load function you can load back the saved environment.

save.image(file='myEnvironment.RData')
load('myEnvironment.RData')

Installing Packages

For future tasks, you will need to install different packages. To perform this installation, either select “Install Packages” from “Tools” menu in RStudio, or use the install.packages command in console. Let’s install stringr package that you will be using in the class using this command.

# install.packages("stringr") 
# I have this package already installed so I'm skipping this. You should run this command.

R also has control structures that any programming language has so let’s try to install this package with an ‘if’ structure.

if(!"stringr"%in%installed.packages()){
  install.packages("stringr") 
}

What this chunk does is, checks whether stringr is installed and installs it if it’s not installed. installed.packages() command lists all the installed packages in your default libraries. %in% expression checks whether the left hand side is in the right hand side and ‘!’ notation is negation in R. As before, ‘install.packages’ command installs the package. There are for loops, while loops, try-catch structures etc. in R. We can talk about these when the need arises. If you want to see more examples on control structures, go to one of the several tutorials available online.

Basic Data Wrangling with `dplyr`

The dplyr package makes these steps fast and easy:

By constraining your options, it helps you think about your data manipulation challenges.
It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
It uses efficient backends, so you spend less time waiting for the computer.

%>% is called pipe. By piping each function in the dplyr package, you can make various analysis.

Below, we present some of the basic things you can do with it. For more, please refer to the official dplyr documentation.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

head(iris %>% select(Sepal.Length, Sepal.Width)) # Selects only 2 columns in the data frame.

##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9

iris %>% filter(Sepal.Length < 5, Sepal.Width > 3.5) # You can filter the data frame.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.6         3.6          1.0         0.2  setosa
## 2          4.9         3.6          1.4         0.1  setosa
##   New_Numeric_Var_1 New_Numeric_Var_2
## 1               8.2          1.277774
## 2               8.5          1.361107

iris %>% rowwise() %>% summarise(Sepal.Area = Sepal.Length*Sepal.Width) # You can create new features

## # A tibble: 150 x 1
##    Sepal.Area
##         <dbl>
##  1       17.8
##  2       14.7
##  3       15.0
##  4       14.3
##  5       18  
##  6       21.1
##  7       15.6
##  8       17  
##  9       12.8
## 10       15.2
## # … with 140 more rows