When working with data you must:
- Figure out what you want to do.
- Describe those tasks in the form of a computer program.
- Execute the program.
The dplyr package makes these steps fast and easy:
- By constraining your options, it helps you think about your data manipulation challenges.
- It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
- It uses efficient backends, so you spend less time waiting for the computer.
Dplyr aims to provide a function for each basic verb of data manipulation:
filter()to select cases based on their values.
arrange()to reorder the cases.
select()select variables based on their names.
transmute()to add new variables that are functions of existing variables.
summarise()to condense multiple values to a single value.
sample_frac()to take random samples.
Before we dive into the functions, let’s load up the packages:
install.packages("dplyr") library(dplyr) install.packages("nycflights13") library(nycflights13)
dim(flights)# R Default Dataframe flight
 336776 19
Filter rows with
filter(flights, month == 1, day == 1)
Arrange rows with
arrange() works similarly to
filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)
desc() to order a column in descending order:
Select columns with
Often you work with large datasets with many columns but only a few are actually of interest to you.
select()allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:
# Select all columns between year and day (inclusive) select(flights, year:day)
# Select all columns except those from year to day (inclusive) select(flights, -(year:day))
Add new columns with
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of
dplyr::mutate() is similar to the base
transform(), but allows you to refer to columns that you’ve just created:
mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60 )
If you only want to keep the new variables, use
transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )
Summarise values with
The last verb is
summarise(). It collapses a data frame to a single row.
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
Randomly sample rows with
You can use
sample_frac() to take a random sample of rows: use
sample_n() for a fixed number and
sample_frac() for a fixed fraction.
summarise() with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like
IQR(). dplyr provides a handful of others:
n(): the number of observations in the current group
n_distinct(x):the number of unique values in
nth(x, n)– these work similarly to
x[n]but give you more control over the result if the value is missing.
summarise(group_by(flights, dest), planes = n_distinct(tailnum), flights = n() )
The count function tallies observations based on a group. It is slightly similar to the table function in the base package. For example:
The pipe operator in R, represented by
%>% can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.
filter_fl = filter(flights,carrier ==”FL”)
With piping, the above code can be rewritten as:
flights %>% filter(carrier == “FL”) %>% arrange(desc(month))