Wednesday, July 29, 2020

Visualising relationships

How to produce a scatter plot


In this section, we'll look at one of the most basic tools for visualising relationships between continuous variables - the scatter plot. We'll be using ggplot again here.

Scatter plots are great. They allow you to visually represent your data, and if relationships exist between the variables plotted, you'll be able to see them straight away.

Let's dive straight in with an example, using the good old mpg dataset.

We'd like to know if there's a relationship between a car's engine size (the displ variable) and its fuel efficiency in the city (the cty variable, measured in miles per gallon). It feels like there should be a relationship there, right? (What direction do you expect the relationship to be?) To find out, type:

ggplot(mpg) +
geom_point(mapping = aes(x  = displ, y = cty))

graph

Yes! As engine size increases, fuel efficiency decreases. If you didn't get a plot, you may need to type library("tidyverse") again.

This code worked in much the same way as our barchart and histogram from Section 1 - we defined a canvas and the mpg dataset using ggplot(mpg), and then added a layer of points geom_point, with an aesthetic mapping displ to the x-axis, and cty to the y-axis.

Visualizing Relationships using ggplot


Adding aesthetics


It's great to see the inverse relationship between engine size and fuel efficiency, but I feel like there are other dimensions here. For instance, a few cars stick out from the pack towards the lower right of this dataset. What's going on with them?

scatterplot graph

These cars have large engines, but are slightly more fuel efficient. I have a hypothesis that these cars are rear-wheel drive. It feels like they probably have larger engines than front-wheel drive, and maybe they're more efficient as well (you might be able to tell that I don't know a lot about cars). To investigate this hypothesis, we can colour the points in the plot by the type of "drive" of the car, variable drv:

ggplot(mpg) + geom_point(mapping = aes(x = displ, y = cty, colour = drv))

coloured drive scatterplot graph

(If you are American, you're welcome to spell "colour" the wrong way - without the "u" - as the plot will still work.) The points are coloured now by the type of drive: front f, rear r, or 4-wheel 4. We can now see clearly that four-wheel drive cars are generally less fuel efficient, and front-wheel drive cars have usually got smaller engines.

It looks like the answer about whether those outlying cars are rear-wheel drives or not is a resounding "maybe". Some of the cars in that group are rear-wheel drive, but not all. We would need to do some more investigation here, but we have learnt some new things about our dataset through this explanation.

Adding layers


As we've alluded to already, ggplot is built upon layers, making it an incredibly powerful tool for creating more complex visualisations. The layers let us build up the plot adding extra visualisations to discover the patterns in the data. For example, adding a trendline to our original scatter plot is simple:

 ggplot(mpg) + 
  geom_point(mapping = aes(x  = displ, y = cty)) +
  geom_smooth(mapping = aes(x  = displ, y = cty)) 

scatterplot with trendline

I have added the trendline as it makes it easier to visualise the relationship.

It's a bit painful to have to type that mapping out twice, so we can make our code a little bit more compact:

 ggplot(mpg, mapping = aes(x  = displ, y = cty)) + 
  geom_point() +
  geom_smooth() 

Adding separate trendlines for each type of drive in our coloured scatterplot is similar:

 ggplot(mpg, mapping = aes(x = displ, y = cty, colour = drv)) + 
  geom_point() + 
  geom_smooth() 

scatterplot with separate trendlines

Our figure is now getting a bit messy, so we can split each type of drive into its own plot using a facet_wrap:

 ggplot(mpg, mapping = aes(x = displ, y = cty)) + 
  geom_point() + 
  geom_smooth() +
  facet_wrap( ~ drv) 

plots split into each type of drive

Describing relationships between different types of variables


Which plot do I want?


You should now recognise that there are two main types of variables: quantitative and categorical. When illustrating the relationship between variables, the plot we use is dictated by the type of variable we are comparing.

figure with types of plots

From the figure we can see that the plots are:

  • Categorical (C) versus categorical (C): barcharts
  • Quantitative (Q) versus quantitative (Q): scatterplots, and
  • Quantitative (Q) versus categorical (C): boxplots.

For the final segment - categorical (C) versus quantitative (Q) - we can also use boxplots. While these suggestions are not the only possibilities, they are a good starting point.


Interpreting plots



Q vs Q: Strength, linearity, outliers, direction

To compare quantitative with quantitative, we use a scatterplot. But how do we describe this relationship? With direction, strength, and linearity.

Let's take each of these one at a time.
First, direction.
This describes how the variable on the vertical y-axis changes as the variable on the horizontal x-axis increases. This scatterplot has a positive direction, as when we go from left to right, the points generally increase. The next scatterplot has a negative direction, because the points decrease when we go from left to right.

Next, we want to examine the strength, which is how close to a straight line the points lie.
This is described as weak... moderate... or strong.

There's no strict definition of the boundaries between weak, moderate, and strong relationships, but roughly, we generally think of the correlations with magnitudes between 0.1 and 0.3 as being weak, between 0.3 and 0.6 as being moderate, and greater than 0.6 as being strong.

You can compute the Pearson correlation in R using the core function. For example, if we wanted to compute the correlation between engine size and city fuel efficiency in the MPG dataset, we'd type this

cor(mpg$displ,mpg$cty)

to get negative 0.8, a strong negative relationship. Always be careful using the correlation on its own.
It should only be used with a scatterplot. For example, take a look at this scatterplot. It has a definite relationship, which is very strong, but it has a correlation of 0. Correlation is all about linear relationships. You need to look at the data to see non-linear relationships like this one.

The final question we ask is, is the relationship best described by a straight line, what we call a linear relationship or is there some curvature, a non-linear relationship? In this scatterplot, we see that the relationship is curved, so it's a non-linear relationship.

A final thing to consider is whether there is any point that sticks out, what we call an outlier. In the next section, we'll show how these can be identified with a box plot. But in scatterplots, it's a bit of a judgement call. In this scatterplot, there's a point at 1.510 that sticks out. We'll look at how to deal with these points in the next section.

Finally, you have a go yourself, back to the MPG dataset. This scatterplot has the city fuel efficiency
against the displacement. Think about if it passes the stupidity test. That is, does it make sense in context? Probably cars with bigger displacement are more powerful, but most likely to be less fuel efficient, so this relationship does make sense.

Next, we'll look at the final relationship, quantitative versus categorical.



Bar chart showing the proportion of cars of each colour, by manufacturer:

A luxury car dealership sells cars from four manufacturers (Ferrari, Jaguar, Lamborghini, and Mercedes) in four colours (green, red, silver, and yellow). Below is a bar chart showing the proportion of cars of each colour, by manufacturer.

Bar chart showing the average winning margin by venue:

The Australian Football League (AFL) plays games at 9 “home” venues. The results for all VFL/AFL games played at these home venues from 8 May 1897 to 23 July 2017 (inclusive) were analysed, and the average winning margin at each venue calculated. Below is a bar chart showing the average winning margin by venue.


Scatterplot of Petal.Width vs Petal.Length:

The lengths and widths of petals and sepals were recorded for 150 Irises from three species. Below is a scatterplot of Petal.Width vs Petal.Length.

library(tidyverse)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
ggplot(iris) +
  geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species))

Getting the dataframe right


For the final part of this section, you're going to have a turn at using what you've learnt so far to analyse some very messy datasets - books - and visualise the relationships between them.

But first, we need to prepare our data.

We've seen in Section 1 how to import the complete novels of Jane Austen into R, and tidy them so that each word is a subject:

library(tidyverse)
library(tidytext)
library(janeaustenr)
library(stringr)
original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()
tidy_books <- original_books %>%
  unnest_tokens(word, text)

Now, if we want to count the number of instances of each word, we can use the count function in the dplyr package, which we'll be using extensively in the next section.

library(dplyr)
tidy_books %>%
  anti_join(stop_words) %>% 
  count(word, sort = TRUE)
## # A tibble: 13,914 x 2
##      word     n
##     <chr> <int>
##  1   miss  1855
##  2   time  1337
##  3  fanny   862
##  4   dear   822
##  5   lady   817
##  6    sir   806
##  7    day   797
##  8   emma   787
##  9 sister   727
## 10  house   699
## # ... with 13,904 more rows

We've removed common "stop words" here before counting, and used the "pipe" function %>% to pass the output of each line directly to the next. We can use this to make a plot by piping directly to ggplot:

tidy_books %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  filter(n > 700) %>%
  ggplot(aes(word, n)) +
  geom_col()

bar chart showing count of word instances

We've added an extra step here to filter out the less-frequent words, and used a new column plot geom_col in ggplot to just plot the word counts n as bars.

We'll make use of the gutenbergr package here too, which lets us download novels from the Project Gutenberg website directly into R. For example, to download Dickens' novel Oliver Twist, we would go to Project Gutenberg, search for the novel, then look at the unique ID for the novel in the URL at the top of the page. For Oliver Twist, this ID is 730. Install the gutenbergr package (see Section 1 for a reminder) and then:

library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 3.4.1
oliver_twist <- gutenberg_download(730)
oliver_twist
## # A tibble: 18,798 x 2
##    gutenberg_id                      text
##           <int>                     <chr>
##  1          730              OLIVER TWIST
##  2          730                          
##  3          730                        OR
##  4          730                          
##  5          730 THE PARISH BOY'S PROGRESS
##  6          730                          
##  7          730                          
##  8          730                        BY
##  9          730                          
## 10          730           CHARLES DICKENS
## # ... with 18,788 more rows

And you can collect multiple books by providing a list. For example, to get both of Lewis Carroll's "Alice" novels, Alice's Adventures in Wonderland (ID 11) and Through the Looking-Glass (ID 12):

alice_books <- gutenberg_download(c(11,12))

We're now ready to have a lot of fun, visualising the most frequent words for different authors, as well as the relationships between different texts.





Reading Jane Austen novels into R

In this section, we are going to look at how to deal with variables that are more complicated than those discussed so far: natural language.

To start off, load some packages into R.

library(tidyverse)
library(tidytext)
library(janeaustenr)
library(stringr)

Don't forget to install the packages!

Get the Jane Austen books from the package janeaustenr and do some cleaning.

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()

Notice those commands group_by, mutate, and ungroup? There's more information about them in Section 3: Manipulating and joining data. For now, know that these commands have created the following dataframe:

original_books
## # A tibble: 73,422 x 4
##                     text                book linenumber chapter
##                    <chr>              <fctr>      <int>   <int>
##  1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
##  2                       Sense & Sensibility          2       0
##  3        by Jane Austen Sense & Sensibility          3       0
##  4                       Sense & Sensibility          4       0
##  5                (1811) Sense & Sensibility          5       0
##  6                       Sense & Sensibility          6       0
##  7                       Sense & Sensibility          7       0
##  8                       Sense & Sensibility          8       0
##  9                       Sense & Sensibility          9       0
## 10             CHAPTER 1 Sense & Sensibility         10       1
## # ... with 73,412 more rows

Type the commands into R and then have a look at the data frame:

  1. What are the subjects?
  2. What are the variables?
  3. What type of variables are they?

We are now going to change the form of the dataframe using the function from the tidytext package called the unnest_tokens() function.

tidy_books <- original_books %>%
  unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 x 4
##                   book linenumber chapter        word
##                 <fctr>      <int>   <int>       <chr>
##  1 Sense & Sensibility          1       0       sense
##  2 Sense & Sensibility          1       0         and
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0          by
##  5 Sense & Sensibility          3       0        jane
##  6 Sense & Sensibility          3       0      austen
##  7 Sense & Sensibility          5       0        1811
##  8 Sense & Sensibility         10       1     chapter
##  9 Sense & Sensibility         10       1           1
## 10 Sense & Sensibility         13       1         the
## # ... with 725,045 more rows

Now what are the subjects and variables?


Summarizing data: Mean, standard deviation etc.

Summarizing data: Mean, standard deviation etc.

For continuous variables, we can look at the mean and standard deviation.
Picking on the displacement column from the MPG dataset, take a wild guess at how you would calculate the mean and standard deviation.

You use the mean command to compute the mean and SD for the standard deviation.

Here are the results

mean(mpg$displ)
[1] 3.471795
sd(mpg$displ) [1] 1.291959


Statisticians often talk about the five number summary, which describes a distribution of data in just five numbers. These are the sample minimum, or the smallest value, the first, or lower, quartile, the median, or middle, value, the third, or upper quartile, and the sample maximum, or the largest value.

R makes the five number summary incredibly simple.

You type 'summary', like this
summary((mpg$displ))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.400   3.300   3.472   4.600   7.000 

Notice that summary is so good that you get a sixth number, the mean, for free.
If you really want just the five numbers, use the fivenum command

fivenum(mpg$displ)
[1] 1.6 2.4 3.3 4.6 7.0


This gives us a fairly good idea of what this data set looks like, but a picture tells a thousand words
and we can visualize the entire distribution by creating a histogram using ggplot.

If you type these commands, you'll get this picture
ggplot(mpg,aes(displ))+
geom_histogram(col="black")+
theme(text=element_text(size=30))

Tables and barcharts in R

Now that you know a bit about subjects and variables, it's time for a deeper dive on summarizing different types of variables. Let's start with categorical variables: the appropriate way to summaries categorical variables is using tables and barcharts.

Looking again at the mpg dataset, a good guide is that the columns containing characters <chr> are categorical variables. Take a look at the first column, the manufacturer. How many of each brand of car are there?

One way to answer this is to make a table. Do you remember how to select columns? Making a table of counts of each type is not much more difficult:

table(mpg$manufacturer)
## 
##       audi  chevrolet      dodge       ford      honda    hyundai 
##         18         19         37         25          9         14 
##       jeep land rover    lincoln    mercury     nissan    pontiac 
##          8          4          3          4         13          5 
##     subaru     toyota volkswagen 
##         14         34         27

This shows you that there are 18 Audis in the dataset, 19 Chevrolets, and so on. Fine, but you might like to know the proportion of each type of car, and dividing by 234 isn't such a simple thing to do in your head (at least, not for everyone!). Luckily, you can pass the table to the R function prop.table to convert all these numbers into proportions:

prop.table(table(mpg$manufacturer))
## 
##       audi  chevrolet      dodge       ford      honda    hyundai 
## 0.07692308 0.08119658 0.15811966 0.10683761 0.03846154 0.05982906 
##       jeep land rover    lincoln    mercury     nissan    pontiac 
## 0.03418803 0.01709402 0.01282051 0.01709402 0.05555556 0.02136752 
##     subaru     toyota volkswagen 
## 0.05982906 0.14529915 0.11538462

So, now you know that about 15.8% of the cars are Dodges, and 10.7% are Fords. It might be nicer still to represent this information as a bar chart, so you don't have to read all those numbers. This is where you turn to your newest friend, the ggplot package, which will become our constant companion over the next few sections. To create a barchart, type the command

ggplot(mpg,aes(manufacturer)) +
geom_bar() +
theme(text = element_text(size = 30), axis.text.x = element_text(angle = 90))

count of manufacturer bar chart