Programming in R

Part 2: Data Types and Data Structures

Author
Affiliation

Ethan Marzban

DS Collab

Prerequisites

  • Please make sure you have downloaded both R and RStudio.

  • Please make sure you are comfortable with the material in Part 1 (Fundamentals).

Data Types

Much in the way different variables (in a statistical sense) can have different types (numerical vs. categorical), so to can quantities in R. For example, R treats the quantity 2 differently from the quantity "hello". We use the term data type to, loosely speaking, refer to the actual type of a particular quantity (e.g. numerical, character, etc.) The main data types in R are:

  • double: refers to numerical (real-valued) quantities
  • integer: refers to integers
  • character: refers to character- or text-type data (and will always be enclosed in either single quotation marks or double quotation marks)
Important

In R, numbers are by default encoded with type double.

To check the type of a particular quantity, we can use the typeof() function. For example:

typeof(1.1)
[1] "double"
typeof(1)
[1] "double"

Again, note that the type of 1 is actually double! If we really wanted to reference the integer 1, we can use an L:

typeof(1L)
[1] "integer"

Data Structures

We can combine quantities in R as well, to produce what are known as data structures. Some of more common data structures in R are:

  • Vectors
  • Data Frames
  • Matrices
  • Arrays
  • Lists

Each data structure has a situation in which it is ideal; for now, let’s discuss vectors and data frames.

Vectors

Vectors are the most fundamental data structure in R. A vector is, effectively, a one-dimensional list of values. In R, we use the following syntax to create a vector:

c(<element 1>, <element 2>, ...)

For example,

c(1L, 3.5, "hello")
[1] "1"     "3.5"   "hello"

(By the Way: note how, in the output, R has converted all elements to be of type character! We’ll revisit this later.)

Data Frames

Another very popular way of storing data in R is using what is known as a data frame. A data frame can be thought of as a collection of vectors, arranged in tabular format; indeed, when storing data in a data frame, we are necessarily implementing the data matrix representation of data (see Workshop 02 for a refresher on this).

Data frames can be created using the data.frame() function in R:

data.frame(
  colname1 = c(val1, val2, ...),
  colname2 = c(val1, val2, ...),
  ...
)

For example:

data.frame(
  col1 = c(2, 4, 6),
  col2 = c(1, 3, 5)
)
  col1 col2
1    2    1
2    4    3
3    6    5
Note

When displaying data frames, R always puts an initial column corresponding to row indices.

Note that data frames are created column-wise. Additionally, the columns in a data frame must all be of the same length;

data.frame(
  col1 = c(2, 4, 6),
  col2 = c(1, 3, 5, 7)
)
Error in data.frame(col1 = c(2, 4, 6), col2 = c(1, 3, 5, 7)): arguments imply differing number of rows: 3, 4
Tip

The columns in a data frame can be comprised of different data types.

Exercise 1

Make a data frame in R based on the following table:

name species age
Rover Dog 4
Samson Parrot 1
Olivia Cat 3
Hoppers Rabbit 2

Assign your data frame to a variable called animals. That is, after completing this exercise, you should be able to run animals and obtain

animals
     name species age
1   Rover     Dog   4
2  Samson  Parrot   1
3  Olivia     Cat   3
4 Hoppers  Rabbit   2

Slicing/Indexing

We may wish to extract or access only certain portions of a vector and/or data frame. This can be accomplished using slicing (aka indexing).

Given a vector x, the ith element of x can be extracted using the syntax x[i]. For example:

x <- c(1, 3, 5, 7)
x[2]
[1] 3

Given a data frame d, the (i, j)th element of d can be extracted using the syntax d[i, j]. For example:

d <- data.frame(
  col1 = c(1, 3, 5, 7),
  col2 = c(2, 4, 6, 8)
)

d[4, 2]
[1] 8

We can also access columns in a data frame by using the $ operator. That is: given a data frame df with a column named col1, the synax df$col1 returns the col1 column of df. For example, given the d data frame defined above, we can access col2 by running

d$col2
[1] 2 4 6 8
Tip

When returning columns of a data frame using slicing, R always returns a vector.

Exercise 2

Returning to the animals data frame created above: return Olivia’s age in two different ways:

  1. by indexing
  2. by first extracting the age column, and then extracting the appropriate element.

Importing Data

In many situations, we will want to import data that has been written or stored in a file or website. The basic function for reading data into R is read.table().

Exercise 3

Look up the help file for the read.table() function. Additionally, look up the help file for the read.csv() function, and note how it differs from the read.table() function.

Exercise 4

Import the data located at https://pstat5a.github.io/Files/Datasets/movies_2000s.csv, and assign it to a variable called movies_2000. (Though you can technically download the data, try to import the data without downloading it onto your local machine). We will explore this data during the in-person portion of the Workshop in General Meeting 3.