Wednesday, July 29, 2020

Reading Jane Austen novels into R

In this section, we are going to look at how to deal with variables that are more complicated than those discussed so far: natural language.

To start off, load some packages into R.

library(tidyverse)
library(tidytext)
library(janeaustenr)
library(stringr)

Don't forget to install the packages!

Get the Jane Austen books from the package janeaustenr and do some cleaning.

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()

Notice those commands group_by, mutate, and ungroup? There's more information about them in Section 3: Manipulating and joining data. For now, know that these commands have created the following dataframe:

original_books
## # A tibble: 73,422 x 4
##                     text                book linenumber chapter
##                    <chr>              <fctr>      <int>   <int>
##  1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
##  2                       Sense & Sensibility          2       0
##  3        by Jane Austen Sense & Sensibility          3       0
##  4                       Sense & Sensibility          4       0
##  5                (1811) Sense & Sensibility          5       0
##  6                       Sense & Sensibility          6       0
##  7                       Sense & Sensibility          7       0
##  8                       Sense & Sensibility          8       0
##  9                       Sense & Sensibility          9       0
## 10             CHAPTER 1 Sense & Sensibility         10       1
## # ... with 73,412 more rows

Type the commands into R and then have a look at the data frame:

  1. What are the subjects?
  2. What are the variables?
  3. What type of variables are they?

We are now going to change the form of the dataframe using the function from the tidytext package called the unnest_tokens() function.

tidy_books <- original_books %>%
  unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 x 4
##                   book linenumber chapter        word
##                 <fctr>      <int>   <int>       <chr>
##  1 Sense & Sensibility          1       0       sense
##  2 Sense & Sensibility          1       0         and
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0          by
##  5 Sense & Sensibility          3       0        jane
##  6 Sense & Sensibility          3       0      austen
##  7 Sense & Sensibility          5       0        1811
##  8 Sense & Sensibility         10       1     chapter
##  9 Sense & Sensibility         10       1           1
## 10 Sense & Sensibility         13       1         the
## # ... with 725,045 more rows

Now what are the subjects and variables?


No comments:

Post a Comment

Please keep your comments relevant.
Comments with external links and adult words will be filtered.