Wednesday, July 29, 2020

Summarizing data: Mean, standard deviation etc.

Summarizing data: Mean, standard deviation etc.

For continuous variables, we can look at the mean and standard deviation.
Picking on the displacement column from the MPG dataset, take a wild guess at how you would calculate the mean and standard deviation.

You use the mean command to compute the mean and SD for the standard deviation.

Here are the results

mean(mpg$displ)
[1] 3.471795
sd(mpg$displ) [1] 1.291959


Statisticians often talk about the five number summary, which describes a distribution of data in just five numbers. These are the sample minimum, or the smallest value, the first, or lower, quartile, the median, or middle, value, the third, or upper quartile, and the sample maximum, or the largest value.

R makes the five number summary incredibly simple.

You type 'summary', like this
summary((mpg$displ))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.400   3.300   3.472   4.600   7.000 

Notice that summary is so good that you get a sixth number, the mean, for free.
If you really want just the five numbers, use the fivenum command

fivenum(mpg$displ)
[1] 1.6 2.4 3.3 4.6 7.0


This gives us a fairly good idea of what this data set looks like, but a picture tells a thousand words
and we can visualize the entire distribution by creating a histogram using ggplot.

If you type these commands, you'll get this picture
ggplot(mpg,aes(displ))+
geom_histogram(col="black")+
theme(text=element_text(size=30))

No comments:

Post a Comment

Please keep your comments relevant.
Comments with external links and adult words will be filtered.