Technology & Project Management tips and tricks: Summarizing data: Mean, standard deviation etc.

Wednesday, July 29, 2020

Summarizing data: Mean, standard deviation etc.

For continuous variables, we can look at the mean and standard deviation.
Picking on the displacement column from the MPG dataset, take a wild guess at how you would calculate the mean and standard deviation.

You use the mean command to compute the mean and SD for the standard deviation.

Here are the results

mean(mpg$displ)
[1] 3.471795
sd(mpg$displ)
[1] 1.291959

Statisticians often talk about the five number summary, which describes a distribution of data in just five numbers. These are the sample minimum, or the smallest value, the first, or lower, quartile, the median, or middle, value, the third, or upper quartile, and the sample maximum, or the largest value.

R makes the five number summary incredibly simple.

You type 'summary', like this

summary((mpg$displ))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.400   3.300   3.472   4.600   7.000

Notice that summary is so good that you get a sixth number, the mean, for free.
If you really want just the five numbers, use the fivenum command

fivenum(mpg$displ)
[1] 1.6 2.4 3.3 4.6 7.0

This gives us a fairly good idea of what this data set looks like, but a picture tells a thousand words
and we can visualize the entire distribution by creating a histogram using ggplot.

If you type these commands, you'll get this picture

ggplot(mpg,aes(displ))+
geom_histogram(col="black")+
theme(text=element_text(size=30))

Wednesday, July 29, 2020

Summarizing data: Mean, standard deviation etc.

Summarizing data: Mean, standard deviation etc.

No comments:

Post a Comment

Total Pageviews

Followers