Data Manipulation with dplyr

When working with data you must:

  • Figure out what you want to do.
  • Describe those tasks in the form of a computer program.
  • Execute the program.

The dplyr package makes these steps fast and easy:

  • By constraining your options, it helps you think about your data manipulation challenges.
  • It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
  • It uses efficient backends, so you spend less time waiting for the computer.

Dplyr aims to provide a function for each basic verb of data manipulation:

  • filter() to select cases based on their values.
  • arrange() to reorder the cases.
  • select()  select variables based on their names.
  • mutate() and transmute() to add new variables that are functions of existing variables.
  • summarise() to condense multiple values to a single value.
  • sample_n() and sample_frac() to take random samples.
  • count()
  • pipe 

Before we dive into the functions, let’s load up the packages:

install.packages("dplyr")
library(dplyr)
install.packages("nycflights13")
library(nycflights13)

dim(flights) # R Default Dataframe flight

dim(flights)
[1] 336776 19

Filter rows with filter()

filter(flights, month == 1, day == 1)

q

Arrange rows with arrange()

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)

q

Use desc() to order a column in descending order:

arrange(flights, desc(arr_delay))

q

Select columns with select()

Often you work with large datasets with many columns but only a few are actually of interest to you. select()allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

# Select all columns between year and day (inclusive)
select(flights, year:day)

q

# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))

q.PNG

Add new columns with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

dplyr::mutate() is similar to the base transform(), but allows you to refer to columns that you’ve just created:

mutate(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

q

If you only want to keep the new variables, use transmute():

transmute(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

q

Summarise values with summarise()

The last verb is summarise(). It collapses a data frame to a single row.

summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE))

q

Randomly sample rows with sample_n() and sample_frac()

You can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction.

sample_n(flights, 5)

q

sample_frac(flights, 0.01)

q

You use summarise() with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min()max()mean()sum()sd()median(), and IQR(). dplyr provides a handful of others:

  • n(): the number of observations in the current group
  • n_distinct(x):the number of unique values in x.
  • first(x)last(x) and nth(x, n) – these work similarly to x[1]x[length(x)], and x[n] but give you more control over the result if the value is missing.summarise(group_by(flights, dest), planes = n_distinct(tailnum), flights = n() )

    q.PNG

Count
The count function tallies observations based on a group. It is slightly similar to the table function in the base package. For example:

count(flights,year)

q

count(flights,carrier)

q

Pipe
The pipe operator in R, represented by %>% can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.

filter_fl = filter(flights,carrier ==”FL”)

arrange(filter_fl, desc(month))

With piping, the above code can be rewritten as:

flights %>% filter(carrier == “FL”) %>% arrange(desc(month))

q

 

%d bloggers like this: