Friday, July 31, 2020

Manipulating and joining data

Selecting subjects based on criteria


Earlier, you learnt how to access rows (subjects) and columns (variables) in a dataframe. For example, to get the first row of the mpg dataset, you would use:

mpg[1,]
## # A tibble: 1 x 11
##   manufacturer model displ  year   cyl    trans   drv   cty   hwy    fl
##          <chr> <chr> <dbl> <int> <int>    <chr> <chr> <int> <int> <chr>
## 1         audi    a4   1.8  1999     4 auto(l5)     f    18    29     p
## # ... with 1 more variables: class <chr>

This way is fine most of the time, but occasionally I would like to be more precise with my filtering. So, I'm going to introduce you to the package dplyr in the tidyverse package, which is oh-so-much nicer! Also, as a treat, we will move away from the mpg dataset (please have a moment to reminisce – we’ll miss you mpg!) and introduce the flights dataset. This dataset is in the nycflights13 package.

If you've forgotten how to get packages, here's a quick recap: In R, go to the top menu bar and click Tools > Install Packages… and then type the name of the package you want (Hint: It's the nycflights13 package). The nycflights13 package contains information on all 336776 flights out of New York City in 2013.

library(nycflights13)
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The first verb that dplyr introduces is filter(). This lets us filter the subjects that we want, that is, filter rows. So first, pause for a second and decide what the subjects in the flights dataset are.

Let's start by filtering for just the flights in January (Month 1):

filter(flights, month == 1)
## # A tibble: 27,004 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 26,994 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

This gives us 27004 rows (refer to the top of the output).

Breaking this command down:

  • filter(): function to do all of the work
  • flights: first argument in the brackets gives the dataframe name, and
  • month == 1: this is the criteria to filter on.

In this case, the variable month is equal to == the first month 1. Notice the use of a double = to indicate "equal to"? A single = will cause an error. If you don't believe me, try it!

How about the first day of January?

filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

We can just keep adding criteria.

Aside - magrittr

As we build up more verbs in our data manipulation tool bag, we are going to end up with lots of nested functions. Instead, we are going to use the magrittr or "pipe" symbol %>%.

This command can be read as "then", and is used to join verbs. For example, to get the first of January, we can rewrite the above command as:

flights %>% filter(month == 1, day == 1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

...which we read as:

  • get the dataframe: flights,
  • then: %>%, and
  • filter for Jan 1: filter(month == 1, day == 1).

OK, your go. Filter for all American Airlines flights (AA)!

My turn:

flights %>% filter(carrier == "AA")
## # A tibble: 32,729 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      558            600        -2      753
##  3  2013     1     1      559            600        -1      941
##  4  2013     1     1      606            610        -4      858
##  5  2013     1     1      623            610        13      920
##  6  2013     1     1      628            630        -2     1137
##  7  2013     1     1      629            630        -1      824
##  8  2013     1     1      635            635         0     1028
##  9  2013     1     1      656            700        -4      854
## 10  2013     1     1      656            659        -3      949
## # ... with 32,719 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

That's how to filter(). Next up are some quiz questions, followed by how to use select().

Select

The thing about the ncyflights13 dataset (or many “Big” datasets) is that it is too wide to fit onto the screen. In all the examples of filtering above we couldn’t even see all of the variables – there were always 12 columns which didn’t fit on the screen. It’d be nice if we could make our dataframe a bit narrower, so that we can fit the information we’re interested in (and nothing else) onto the screen. This is what select does – it’s essentially a filter, but for columns rather than rows.

So let’s say I was just interested in departure times and arrival times for all American Airways flights on the 1st of January, then I could just add to our filter example from before:

flights %>% filter(carrier == "AA",month == 1, day == 1) %>% select(flight,dep_time,arr_time)
## # A tibble: 94 x 3
##    flight dep_time arr_time
##     <int>    <int>    <int>
##  1   1141      542      923
##  2    301      558      753
##  3    707      559      941
##  4   1895      606      858
##  5   1837      623      920
##  6    413      628     1137
##  7    303      629      824
##  8    711      635     1028
##  9    305      656      854
## 10   1815      656      949
## # ... with 84 more rows

Now we only have 94 × 3 = 282 pieces of information in our dataframe – this is starting to look more manageable!

We can get fancy with select: if I wanted to grab just the variables from flights that have something to do with “time”, then I could use the contains command:

select(flights, contains("time"))

or

flights %>% select(contains("time"))

In a similar vein, you might be able to guess what starts_with() or ends_with() do. I can also select, say, the columns from year to day:

select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

Type ?select at the console to see some more examples.

So that’s how to access both rows and columns of our dataframe using filter and select respectively, but how about if we want to make some new variables? It’s time for us – like a statistical X-Men character – to mutate.

Making new variables


If you have made it this far, you should now be able to filter() rows and select() columns, but what if you wanted to add new columns? In the dataset flights, we have the departure delay dep_delay, which is the difference between the scheduled departure time (sched_dep_time) and the departure time (dep_time).

Let's assume that this is not given and that we would like to calculate it. We can do that with:

flights %>% mutate(delay = dep_time - sched_dep_time)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, delay <int>

To work through this, you should:

  • Take the dataset: flights
  • Then: %>%
  • Add a new column: mutate()
  • Define the delay: delay =, which is calculated as the difference: dep_time - sched_dep_time.

If you look at the last row of the output, you see the new column delay int. But wait! That is a trick! Look at the flights dataframe again:

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The new column is not there. Why? Because you have to tell R to save the new column like this:

flights <- flights %>% mutate(delay = dep_time - sched_dep_time)

This is the same as before, but the extra flights <- basically says "do the stuff, then save it as the dataframe flights", which basically means "overwrite it".

Have a play, then we will start grouping with group_by().


Grouping subjects based on criteria



Now that you can filter, select, and mutate, you're in a position to do something quite powerful - clump variables together into groups, and then summarise these groups. Grouping won't look like much just yet, but stick with me on this.

Based on just a little bit of experience, I have a hypothesis about flight delays in New York City: I reckon they're worse in winter. Snowstorms, ice, wind... I suspect that all of these will make delays in the winter months of December, January, February, worse than in the summer months. To investigate this, you'll need to group flights by month, which you can do like this:

by_month <- group_by(flights,month)
by_month
## # A tibble: 336,776 x 20
## # Groups:   month [12]
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, delay <int>

Well, that was underwhelming! This data frame looks pretty much the same as the original, apart from the second line: Groups: month [12]. That tells me that a group has been created for each month, but to explore this further, you'll have to learn to summarise - that's in the next section.

Before you do, two quick notes about group_by:

  • You can group by multiple variables: by_day <- group_by(flights,year,month,day) will create a dataframe with 365 groups for each day of the year (try it!), and
  • You can ungroup a grouped dataframe using, you guessed it, ungroup(). That'll be handy in the next section.

Producing summary statistics for groups of subjects



To test my hypothesis about flight delays, I will need to create a summary statistic about delays for each month. You can do the same by calculating the mean flight departure delay for each calendar month like this:

summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 12 x 2
##    month     delay
##    <int>     <dbl>
##  1     1 10.036665
##  2     2 10.816843
##  3     3 13.227076
##  4     4 13.938038
##  5     5 12.986859
##  6     6 20.846332
##  7     7 21.727787
##  8     8 12.611040
##  9     9  6.722476
## 10    10  6.243988
## 11    11  5.435362
## 12    12 16.576688

Firstly, let's break it down: This has taken my grouped dataframe by_month, and for each group has computed the mean of the values in the dep_delay column for that group. The na.rm = TRUE argument has told the mean function to remove (rm in unix-speak) all values that are not available NA. Basically, some rows in this dataframe do not have an entry in the dep_delay column, so R puts the symbol NA there instead. Trying to calculate the mean of this symbol doesn't work (try it without the na.rm = TRUE bit!), so we get rid of them instead.

Secondly, I don't think my hypothesis was correct. Maybe December had slightly worse delays than the preceeding months, but January and February really weren't so bad, and by far the worst months are June and July. We can make a nice plot of the trends over the entire year using ggplot. We'll use our skillz (not skills!) to group by day this time instead of month, because... it looks cooler!

by_day <- group_by(flights,year,month,day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>% 
mutate(day_num = seq_along(delay)) %>% 
ggplot(aes(day_num,delay)) + 
geom_point() + 
geom_smooth()

flight delay scatterplot

I added a slightly tricky intermediate step here to create a column day_num counting the days of the year: ungroup() %>% mutate(day_num = seq_along(delay)) ungroups by_day, and then creates a sequence along the column delay - essentially counting the row numbers.

Anyway, the middle of the year really does look like a worse time to fly - so much for my hypothesis! Guided by this exploration, I think I have a new one now though: June, July, and December would be the busiest months for flying, so maybe it's simply that there are longer delays when there are more flights. A simple modification to our summarise command will allow us to explore this relationship:

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE), num_flights = n()) %>%
ggplot(aes(num_flights,delay)) + 
geom_point() + 
geom_smooth()

flight delays by number of flights scatterplot

The num_flights = n() bit produces a second summary statistic for each group, which is just the number of items in that group. Note that I don't need to index days of the year now, so I can lose the ungroup bit.

It looks like there's some relationship between number of flights and delays, but it's not particularly strong. Again, some more investigation is needed.


Putting together multiple dataframes with common subjects



Earlier, I asked you to get all rows in the flights dataset that were American Airlines. I told you the abbreviation was “AA” and you were happy to accept that. First: Thank you for trusting me! But you are a statistician now, and so you should be embracing your inner cynic. Not so much… "Show me the money!" as "Show me the data!"

So where did I get this information from? Look at the dataframe airlines in the nycflights13 package:

airlines
## # A tibble: 16 x 2
##    carrier                        name
##      <chr>                       <chr>
##  1      9E           Endeavor Air Inc.
##  2      AA      American Airlines Inc.
##  3      AS        Alaska Airlines Inc.
##  4      B6             JetBlue Airways
##  5      DL        Delta Air Lines Inc.
##  6      EV    ExpressJet Airlines Inc.
##  7      F9      Frontier Airlines Inc.
##  8      FL AirTran Airways Corporation
##  9      HA      Hawaiian Airlines Inc.
## 10      MQ                   Envoy Air
## 11      OO       SkyWest Airlines Inc.
## 12      UA       United Air Lines Inc.
## 13      US             US Airways Inc.
## 14      VX              Virgin America
## 15      WN      Southwest Airlines Co.
## 16      YV          Mesa Airlines Inc.

So there are two dataframes:

  • flights - each row is a flight, with a variable carrier, which gives the abbreviation for the carrier of each flight
  • airlines - each row is an airline company, which also has a variable called carrier.

I would like you to add the full names to the flights dataset and then save it. Have a go!

flights <- left_join(flights, airlines, by = "carrier")

So, does it work? Look at the new dataframe, but select() only a few columns to make it easier to check:

flights %>% select(year, month, day, carrier, name)
## # A tibble: 336,776 x 5
##     year month   day carrier                     name
##    <int> <int> <int>   <chr>                    <chr>
##  1  2013     1     1      UA    United Air Lines Inc.
##  2  2013     1     1      UA    United Air Lines Inc.
##  3  2013     1     1      AA   American Airlines Inc.
##  4  2013     1     1      B6          JetBlue Airways
##  5  2013     1     1      DL     Delta Air Lines Inc.
##  6  2013     1     1      UA    United Air Lines Inc.
##  7  2013     1     1      B6          JetBlue Airways
##  8  2013     1     1      EV ExpressJet Airlines Inc.
##  9  2013     1     1      B6          JetBlue Airways
## 10  2013     1     1      AA   American Airlines Inc.
## # ... with 336,766 more rows

Looks good, but how did it work? Well, left_join takes every row from the left dataset in its brackets, in our case flights, then adds all the information in the right dataframe airline to each row by matching on the carrier by = "carrier". 

Want a cheat sheet to help? Then this is built into RStudio. Go to:

Help > Cheatsheets > Data Manipulation with dplyr, tidyr

RStudio Cheatsheet menu

Want a PDF?

Data Wrangling cheatsheet PDF preview


Getting the dataset into R

Introduction to the dataset, installed the gapminder package:

install.packages("gapminder") 

...and installed the package:

library(gapminder) 

...then there's really nothing more left to do. Type gapminder and you'll see the dataframe, with columns for: country, continent, year, average expectancy, population, and GDP per capita.

Why don't you take a few minutes to try using some of the techniques you've learnt in this section to filter, select, mutate, group_by, summarise, and just generally explore the dataset, and then come back for the activity!