In this section, we are going to look at how to deal with variables that are more complicated than those discussed so far: natural language.
To start off, load some packages into R.
library(tidyverse)
library(tidytext)
library(janeaustenr)
library(stringr)
Don't forget to install the packages!
Get the Jane Austen books from the package janeaustenr
and do some cleaning.
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup()
Notice those commands group_by
, mutate
, and ungroup
? There's more information about them in Section 3: Manipulating and joining data. For now, know that these commands have created the following dataframe:
original_books
## # A tibble: 73,422 x 4
## text book linenumber chapter
## <chr> <fctr> <int> <int>
## 1 SENSE AND SENSIBILITY Sense & Sensibility 1 0
## 2 Sense & Sensibility 2 0
## 3 by Jane Austen Sense & Sensibility 3 0
## 4 Sense & Sensibility 4 0
## 5 (1811) Sense & Sensibility 5 0
## 6 Sense & Sensibility 6 0
## 7 Sense & Sensibility 7 0
## 8 Sense & Sensibility 8 0
## 9 Sense & Sensibility 9 0
## 10 CHAPTER 1 Sense & Sensibility 10 1
## # ... with 73,412 more rows
Type the commands into R and then have a look at the data frame:
- What are the subjects?
- What are the variables?
- What type of variables are they?
We are now going to change the form of the dataframe using the function from the tidytext
package called the unnest_tokens()
function.
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 x 4
## book linenumber chapter word
## <fctr> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
## 7 Sense & Sensibility 5 0 1811
## 8 Sense & Sensibility 10 1 chapter
## 9 Sense & Sensibility 10 1 1
## 10 Sense & Sensibility 13 1 the
## # ... with 725,045 more rows
Now what are the subjects and variables?
No comments:
Post a Comment
Please keep your comments relevant.
Comments with external links and adult words will be filtered.