Wednesday, July 29, 2020

Summarizing data: Mean, standard deviation etc.

Summarizing data: Mean, standard deviation etc.

For continuous variables, we can look at the mean and standard deviation.
Picking on the displacement column from the MPG dataset, take a wild guess at how you would calculate the mean and standard deviation.

You use the mean command to compute the mean and SD for the standard deviation.

Here are the results

mean(mpg$displ)
[1] 3.471795
sd(mpg$displ) [1] 1.291959


Statisticians often talk about the five number summary, which describes a distribution of data in just five numbers. These are the sample minimum, or the smallest value, the first, or lower, quartile, the median, or middle, value, the third, or upper quartile, and the sample maximum, or the largest value.

R makes the five number summary incredibly simple.

You type 'summary', like this
summary((mpg$displ))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.400   3.300   3.472   4.600   7.000 

Notice that summary is so good that you get a sixth number, the mean, for free.
If you really want just the five numbers, use the fivenum command

fivenum(mpg$displ)
[1] 1.6 2.4 3.3 4.6 7.0


This gives us a fairly good idea of what this data set looks like, but a picture tells a thousand words
and we can visualize the entire distribution by creating a histogram using ggplot.

If you type these commands, you'll get this picture
ggplot(mpg,aes(displ))+
geom_histogram(col="black")+
theme(text=element_text(size=30))

Tables and barcharts in R

Now that you know a bit about subjects and variables, it's time for a deeper dive on summarizing different types of variables. Let's start with categorical variables: the appropriate way to summaries categorical variables is using tables and barcharts.

Looking again at the mpg dataset, a good guide is that the columns containing characters <chr> are categorical variables. Take a look at the first column, the manufacturer. How many of each brand of car are there?

One way to answer this is to make a table. Do you remember how to select columns? Making a table of counts of each type is not much more difficult:

table(mpg$manufacturer)
## 
##       audi  chevrolet      dodge       ford      honda    hyundai 
##         18         19         37         25          9         14 
##       jeep land rover    lincoln    mercury     nissan    pontiac 
##          8          4          3          4         13          5 
##     subaru     toyota volkswagen 
##         14         34         27

This shows you that there are 18 Audis in the dataset, 19 Chevrolets, and so on. Fine, but you might like to know the proportion of each type of car, and dividing by 234 isn't such a simple thing to do in your head (at least, not for everyone!). Luckily, you can pass the table to the R function prop.table to convert all these numbers into proportions:

prop.table(table(mpg$manufacturer))
## 
##       audi  chevrolet      dodge       ford      honda    hyundai 
## 0.07692308 0.08119658 0.15811966 0.10683761 0.03846154 0.05982906 
##       jeep land rover    lincoln    mercury     nissan    pontiac 
## 0.03418803 0.01709402 0.01282051 0.01709402 0.05555556 0.02136752 
##     subaru     toyota volkswagen 
## 0.05982906 0.14529915 0.11538462

So, now you know that about 15.8% of the cars are Dodges, and 10.7% are Fords. It might be nicer still to represent this information as a bar chart, so you don't have to read all those numbers. This is where you turn to your newest friend, the ggplot package, which will become our constant companion over the next few sections. To create a barchart, type the command

ggplot(mpg,aes(manufacturer)) +
geom_bar() +
theme(text = element_text(size = 30), axis.text.x = element_text(angle = 90))

count of manufacturer bar chart


Subjects and variables

Defining subjects and variables: Quantitative (discrete, continuous) vs Categorical (ordinal, nominal)


If you look at the mpg dataset, you'll notice a standard way of representing data in R and most standard statistical packages. Each row is a subject, and each column is a variable.

Table of MPG dataset

Subject means the smallest object or entity that you measure. In the mpg dataset, this is types of cars, and each row is a different type of car.

The things that you measure are called variables. So for the first car, the manufacturer is Audi and the model is a4. You refer to the manufacturer and model as variables. In statistics, variables are classified into four main types:

  • categorical ordinal,
  • categorical nominal,
  • quantitative continuous, and
  • quantitative discrete.

Categorical variables are things that can be classified with labels. Categorical ordinal are labels that have an order, for example the bronze, silver and gold medals in the Olympics, while categorical nominal are labels that do not have an order. In the mpg dataset, manufacturer is a categorical nominal variable, while model may be a categorical ordinal as models are often ordered according to price.

Quantitative variables are things that are measured using numbers. Quantitative continuous variables can take any numerical value, including fractions and decimals. For example: time, temperature, length and weight, and anything that is derived from them. On the other hand, quantitative discrete variables are things that are counted, so only take whole-number values. For example you may count the number of people infected with a disease, or the number of cars that cross a bridge.

The main difference is that a sufficiently accurate measuring device can measure quantitative continuous random variable to any value in a range, while discrete will always have gaps between the numbers, for example you can’t have 3.5 people infected with a disease.

In the mpg dataset, the variable cyl stands for the number of cylinders in the car. As cylinders are countable (rather than measured continuously), this variable is a quantitative discrete variable. The variable displ stands for an engine's displacement, which is a volume (in litres). Therefore, being something that is measured, it is a quantitative continuous variable.

Notice that as displacement is measured to the nearest 0.1 of a litre, this variable ends up being discretised. However, it is still considered to be a continuous variable, as theoretically, volume could be recorded to any number of decimal places. On the other hand, you can never have half a cylinder, so the cyl variable will always be quantitative discrete.

So why do we care about this? Because if we know the type of variable, then we know how to deal with it statistically.



Introduction to Dataframes in R

Looking at dataframes in R

Dataframes are the fundamental data structure in R. They are essentially tables consisting of variables and observations. If you can represent your dataset as a dataframe, you're ready to start answering questions about it using the power of R. Let's take a look at a dataframe.

Load the mpg dataframe contained in the tidyverse package:

library(tidyverse)
mpg

## # A tibble: 234 x 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
##  1         audi         a4   1.8  1999     4   auto(l5)     f    18    29
##  2         audi         a4   1.8  1999     4 manual(m5)     f    21    29
##  3         audi         a4   2.0  2008     4 manual(m6)     f    20    31
##  4         audi         a4   2.0  2008     4   auto(av)     f    21    30
##  5         audi         a4   2.8  1999     6   auto(l5)     f    16    26
##  6         audi         a4   2.8  1999     6 manual(m5)     f    18    26
##  7         audi         a4   3.1  2008     6   auto(av)     f    18    27
##  8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
##  9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>


Look at all that data! This dataframe contains data about the fuel economy of cars, collected between 1999 and 2008. Specifically, it's a table containing 234 rows, and 11 columns. It's called a "tibble", which is just a slightly modernised version of R's original data.frame. We'll come back to the mpg dataset in Subjects and Variables.

You can create your own tibbles:

mytibble <- tibble(
  x = 1:4, 
  y = x^2, 
  z = y + 0.1
)
mytibble
## # A tibble: 4 x 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1   1.1
## 2     2     4   4.1
## 3     3     9   9.1
## 4     4    16  16.1

And also create "tribbles": not the troublesome kind of Tribbles, but "transposed tibbles":

mytribble <- tribble(
  ~x, ~y, ~z,
  1, 4.2,"a",
  3, 9.6,"b",
  4,16.8,"c"  
)
mytribble
## # A tibble: 3 x 3
##       x     y     z
##   <dbl> <dbl> <chr>
## 1     1   4.2     a
## 2     3   9.6     b
## 3     4  16.8     c

...which might be easier, depending on your situation.

Great! So, if you have done what I have been showing you here, you should now be able to load dataframes, and even build your own dataframes!

Getting variables from a dataframe

You might want to look at only part of a dataframe. For example, I noticed the names of a number of models of cars in the mpg dataset, so let's focus in on them. You can select a single column using the $ symbol:

mpg$model
##   [1] "a4"                     "a4"                    
##   [3] "a4"                     "a4"                    
##   [5] "a4"                     "a4"                    
##   [7] "a4"                     "a4 quattro"            
##   [9] "a4 quattro"             "a4 quattro"            
##  [11] "a4 quattro"             "a4 quattro"            
##  [13] "a4 quattro"             "a4 quattro"            
##  [15] "a4 quattro"             "a6 quattro"            
##  [17] "a6 quattro"             "a6 quattro"            
##  [19] "c1500 suburban 2wd"     "c1500 suburban 2wd"    
##  [21] "c1500 suburban 2wd"     "c1500 suburban 2wd"    
##  [23] "c1500 suburban 2wd"     "corvette"              
##  [25] "corvette"               "corvette"              
##  [27] "corvette"               "corvette"              
##  [29] "k1500 tahoe 4wd"        "k1500 tahoe 4wd"       
##  [31] "k1500 tahoe 4wd"        "k1500 tahoe 4wd"       
##  [33] "malibu"                 "malibu"                
##  [35] "malibu"                 "malibu"                
##  [37] "malibu"                 "caravan 2wd"           
##  [39] "caravan 2wd"            "caravan 2wd"           
##  [41] "caravan 2wd"            "caravan 2wd"           
##  [43] "caravan 2wd"            "caravan 2wd"           
##  [45] "caravan 2wd"            "caravan 2wd"           
##  [47] "caravan 2wd"            "caravan 2wd"           
##  [49] "dakota pickup 4wd"      "dakota pickup 4wd"     
##  [51] "dakota pickup 4wd"      "dakota pickup 4wd"     
##  [53] "dakota pickup 4wd"      "dakota pickup 4wd"     
##  [55] "dakota pickup 4wd"      "dakota pickup 4wd"     
##  [57] "dakota pickup 4wd"      "durango 4wd"           
##  [59] "durango 4wd"            "durango 4wd"           
##  [61] "durango 4wd"            "durango 4wd"           
##  [63] "durango 4wd"            "durango 4wd"           
##  [65] "ram 1500 pickup 4wd"    "ram 1500 pickup 4wd"   
##  [67] "ram 1500 pickup 4wd"    "ram 1500 pickup 4wd"   
##  [69] "ram 1500 pickup 4wd"    "ram 1500 pickup 4wd"   
##  [71] "ram 1500 pickup 4wd"    "ram 1500 pickup 4wd"   
##  [73] "ram 1500 pickup 4wd"    "ram 1500 pickup 4wd"   
##  [75] "expedition 2wd"         "expedition 2wd"        
##  [77] "expedition 2wd"         "explorer 4wd"          
##  [79] "explorer 4wd"           "explorer 4wd"          
##  [81] "explorer 4wd"           "explorer 4wd"          
##  [83] "explorer 4wd"           "f150 pickup 4wd"       
##  [85] "f150 pickup 4wd"        "f150 pickup 4wd"       
##  [87] "f150 pickup 4wd"        "f150 pickup 4wd"       
##  [89] "f150 pickup 4wd"        "f150 pickup 4wd"       
##  [91] "mustang"                "mustang"               
##  [93] "mustang"                "mustang"               
##  [95] "mustang"                "mustang"               
##  [97] "mustang"                "mustang"               
##  [99] "mustang"                "civic"                 
## [101] "civic"                  "civic"                 
## [103] "civic"                  "civic"                 
## [105] "civic"                  "civic"                 
## [107] "civic"                  "civic"                 
## [109] "sonata"                 "sonata"                
## [111] "sonata"                 "sonata"                
## [113] "sonata"                 "sonata"                
## [115] "sonata"                 "tiburon"               
## [117] "tiburon"                "tiburon"               
## [119] "tiburon"                "tiburon"               
## [121] "tiburon"                "tiburon"               
## [123] "grand cherokee 4wd"     "grand cherokee 4wd"    
## [125] "grand cherokee 4wd"     "grand cherokee 4wd"    
## [127] "grand cherokee 4wd"     "grand cherokee 4wd"    
## [129] "grand cherokee 4wd"     "grand cherokee 4wd"    
## [131] "range rover"            "range rover"           
## [133] "range rover"            "range rover"           
## [135] "navigator 2wd"          "navigator 2wd"         
## [137] "navigator 2wd"          "mountaineer 4wd"       
## [139] "mountaineer 4wd"        "mountaineer 4wd"       
## [141] "mountaineer 4wd"        "altima"                
## [143] "altima"                 "altima"                
## [145] "altima"                 "altima"                
## [147] "altima"                 "maxima"                
## [149] "maxima"                 "maxima"                
## [151] "pathfinder 4wd"         "pathfinder 4wd"        
## [153] "pathfinder 4wd"         "pathfinder 4wd"        
## [155] "grand prix"             "grand prix"            
## [157] "grand prix"             "grand prix"            
## [159] "grand prix"             "forester awd"          
## [161] "forester awd"           "forester awd"          
## [163] "forester awd"           "forester awd"          
## [165] "forester awd"           "impreza awd"           
## [167] "impreza awd"            "impreza awd"           
## [169] "impreza awd"            "impreza awd"           
## [171] "impreza awd"            "impreza awd"           
## [173] "impreza awd"            "4runner 4wd"           
## [175] "4runner 4wd"            "4runner 4wd"           
## [177] "4runner 4wd"            "4runner 4wd"           
## [179] "4runner 4wd"            "camry"                 
## [181] "camry"                  "camry"                 
## [183] "camry"                  "camry"                 
## [185] "camry"                  "camry"                 
## [187] "camry solara"           "camry solara"          
## [189] "camry solara"           "camry solara"          
## [191] "camry solara"           "camry solara"          
## [193] "camry solara"           "corolla"               
## [195] "corolla"                "corolla"               
## [197] "corolla"                "corolla"               
## [199] "land cruiser wagon 4wd" "land cruiser wagon 4wd"
## [201] "toyota tacoma 4wd"      "toyota tacoma 4wd"     
## [203] "toyota tacoma 4wd"      "toyota tacoma 4wd"     
## [205] "toyota tacoma 4wd"      "toyota tacoma 4wd"     
## [207] "toyota tacoma 4wd"      "gti"                   
## [209] "gti"                    "gti"                   
## [211] "gti"                    "gti"                   
## [213] "jetta"                  "jetta"                 
## [215] "jetta"                  "jetta"                 
## [217] "jetta"                  "jetta"                 
## [219] "jetta"                  "jetta"                 
## [221] "jetta"                  "new beetle"            
## [223] "new beetle"             "new beetle"            
## [225] "new beetle"             "new beetle"            
## [227] "new beetle"             "passat"                
## [229] "passat"                 "passat"                
## [231] "passat"                 "passat"                
## [233] "passat"                 "passat"

Do you notice how when you type mpg$, a list of the column names pops up? You can navigate this list using the arrow keys, or even start typing part of a column name to search the list. For example, as the model column is the only column of mpg containing an "o", I could have typed mpg$, then o, then hit enter, and ended up with the same as above. Neat!

Note: You can get the same result as above by typing mpg[['model']], but you don't get the auto-complete in RStudio compared to the first version.

Anyway, that was a lot of car model names (234 to be precise - the same as the number of rows of the dataframe). What if we want to see how many different types of car there are in the dataset? Let's try putting the word unique in front of our previous command...

unique(mpg$model)
##  [1] "a4"                     "a4 quattro"            
##  [3] "a6 quattro"             "c1500 suburban 2wd"    
##  [5] "corvette"               "k1500 tahoe 4wd"       
##  [7] "malibu"                 "caravan 2wd"           
##  [9] "dakota pickup 4wd"      "durango 4wd"           
## [11] "ram 1500 pickup 4wd"    "expedition 2wd"        
## [13] "explorer 4wd"           "f150 pickup 4wd"       
## [15] "mustang"                "civic"                 
## [17] "sonata"                 "tiburon"               
## [19] "grand cherokee 4wd"     "range rover"           
## [21] "navigator 2wd"          "mountaineer 4wd"       
## [23] "altima"                 "maxima"                
## [25] "pathfinder 4wd"         "grand prix"            
## [27] "forester awd"           "impreza awd"           
## [29] "4runner 4wd"            "camry"                 
## [31] "camry solara"           "corolla"               
## [33] "land cruiser wagon 4wd" "toyota tacoma 4wd"     
## [35] "gti"                    "jetta"                 
## [37] "new beetle"             "passat"

It works! There are 38 different types of cars in the dataset.

If you want to select rows rather than columns from the dataframe, use square brackets [], for example:

mpg[194,]
## # A tibble: 1 x 11
##   manufacturer   model displ  year   cyl    trans   drv   cty   hwy    fl
##          <chr>   <chr> <dbl> <int> <int>    <chr> <chr> <int> <int> <chr>
## 1       toyota corolla   1.8  1999     4 auto(l3)     f    24    30     r
## # ... with 1 more variables: class <chr>

If you want a particular entry in the dataframe, provide a column number as well:

mpg[194,2]
## # A tibble: 1 x 1
##     model
##     <chr>
## 1 corolla

And, of course, you can also look at a range of rows and columns as well:

mpg[194:198,1:4]
## # A tibble: 5 x 4
##   manufacturer   model displ  year
##          <chr>   <chr> <dbl> <int>
## 1       toyota corolla   1.8  1999
## 2       toyota corolla   1.8  1999
## 3       toyota corolla   1.8  1999
## 4       toyota corolla   1.8  2008
## 5       toyota corolla   1.8  2008

You should now be able to find your way around dataframes. If you are having trouble, you can always go over this section again until you are familiar with what to do.









Saturday, July 25, 2020

Kampala City Tour | Uganda | Vlog-1