Summarizing data: Mean, standard deviation etc.
For continuous variables, we can look at the mean and standard deviation.
Picking on the displacement column from the MPG dataset, take a wild guess at how you would calculate the mean and standard deviation.
You use the mean command to compute the mean and SD for the standard deviation.
Picking on the displacement column from the MPG dataset, take a wild guess at how you would calculate the mean and standard deviation.
You use the mean command to compute the mean and SD for the standard deviation.
Here are the results
mean(mpg$displ) [1] 3.471795
sd(mpg$displ) [1] 1.291959
Statisticians often talk about the five number summary, which describes a distribution of data in just five numbers. These are the sample minimum, or the smallest value, the first, or lower, quartile, the median, or middle, value, the third, or upper quartile, and the sample maximum, or the largest value.
R makes the five number summary incredibly simple.
You type 'summary', like this
summary((mpg$displ))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.600 2.400 3.300 3.472 4.600 7.000
Notice that summary is so good that you get a sixth number, the mean, for free.
If you really want just the five numbers, use the fivenum command
If you really want just the five numbers, use the fivenum command
fivenum(mpg$displ)
[1] 1.6 2.4 3.3 4.6 7.0
This gives us a fairly good idea of what this data set looks like, but a picture tells a thousand words
and we can visualize the entire distribution by creating a histogram using ggplot.
and we can visualize the entire distribution by creating a histogram using ggplot.
If you type these commands, you'll get this picture
ggplot(mpg,aes(displ))+
geom_histogram(col="black")+
theme(text=element_text(size=30))
No comments:
Post a Comment
Please keep your comments relevant.
Comments with external links and adult words will be filtered.