Data Manipulation with dplyr

When working with data you must:

  • Figure out what you want to do.
  • Describe those tasks in the form of a computer program.
  • Execute the program.

The dplyr package makes these steps fast and easy:

  • By constraining your options, it helps you think about your data manipulation challenges.
  • It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
  • It uses efficient backends, so you spend less time waiting for the computer.

Dplyr aims to provide a function for each basic verb of data manipulation:

  • filter() to select cases based on their values.
  • arrange() to reorder the cases.
  • select()  select variables based on their names.
  • mutate() and transmute() to add new variables that are functions of existing variables.
  • summarise() to condense multiple values to a single value.
  • sample_n() and sample_frac() to take random samples.
  • count()
  • pipe 

Before we dive into the functions, let’s load up the packages:


dim(flights) # R Default Dataframe flight

[1] 336776 19

Filter rows with filter()

filter(flights, month == 1, day == 1)


Arrange rows with arrange()

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)


Use desc() to order a column in descending order:

arrange(flights, desc(arr_delay))


Select columns with select()

Often you work with large datasets with many columns but only a few are actually of interest to you. select()allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

# Select all columns between year and day (inclusive)
select(flights, year:day)


# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))


Add new columns with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

dplyr::mutate() is similar to the base transform(), but allows you to refer to columns that you’ve just created:

  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60


If you only want to keep the new variables, use transmute():

  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)


Summarise values with summarise()

The last verb is summarise(). It collapses a data frame to a single row.

  delay = mean(dep_delay, na.rm = TRUE))


Randomly sample rows with sample_n() and sample_frac()

You can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction.

sample_n(flights, 5)


sample_frac(flights, 0.01)


You use summarise() with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min()max()mean()sum()sd()median(), and IQR(). dplyr provides a handful of others:

  • n(): the number of observations in the current group
  • n_distinct(x):the number of unique values in x.
  • first(x)last(x) and nth(x, n) – these work similarly to x[1]x[length(x)], and x[n] but give you more control over the result if the value is missing.summarise(group_by(flights, dest), planes = n_distinct(tailnum), flights = n() )


The count function tallies observations based on a group. It is slightly similar to the table function in the base package. For example:





The pipe operator in R, represented by %>% can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.

filter_fl = filter(flights,carrier ==”FL”)

arrange(filter_fl, desc(month))

With piping, the above code can be rewritten as:

flights %>% filter(carrier == “FL”) %>% arrange(desc(month))



Introduction to K-means Clustering

Introduction to K-means Clustering

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups).

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

Continue reading “Introduction to K-means Clustering”

Install Python with Jupyter

The easiest way to install the Jupyter Notebook App is installing a scientific python distribution which also includes scientific python packages. The most common distribution is called Anaconda:

  • Download Anaconda Distribution (a few 100MB), Python 3, 64 bits.
  • Install it using the default settings for a single user.

Continue reading “Install Python with Jupyter”

Chapter 2 : Working with SAS Programs.


In this lesson, you’ll learn how to work with SAS code. First, you’ll learn the main components of SAS programs. Then you’ll learn the syntax rules and formatting guidelines for writing SAS programs. As you work with SAS programs, you’ll add descriptive comments, and identify and correct common syntax errors. Continue reading “Chapter 2 : Working with SAS Programs.”

Chapter 1. Getting Started with SAS Programming

What You Learn in This Course

Welcome to the SAS Programming 1: Essentials e-course.

In this course, you’ll write SAS programs to access, manage, and analyze your data, and present the results in reports.


Continue reading “Chapter 1. Getting Started with SAS Programming”

%d bloggers like this: