Wednesday, July 29, 2020

Subjects and variables

Defining subjects and variables: Quantitative (discrete, continuous) vs Categorical (ordinal, nominal)


If you look at the mpg dataset, you'll notice a standard way of representing data in R and most standard statistical packages. Each row is a subject, and each column is a variable.

Table of MPG dataset

Subject means the smallest object or entity that you measure. In the mpg dataset, this is types of cars, and each row is a different type of car.

The things that you measure are called variables. So for the first car, the manufacturer is Audi and the model is a4. You refer to the manufacturer and model as variables. In statistics, variables are classified into four main types:

  • categorical ordinal,
  • categorical nominal,
  • quantitative continuous, and
  • quantitative discrete.

Categorical variables are things that can be classified with labels. Categorical ordinal are labels that have an order, for example the bronze, silver and gold medals in the Olympics, while categorical nominal are labels that do not have an order. In the mpg dataset, manufacturer is a categorical nominal variable, while model may be a categorical ordinal as models are often ordered according to price.

Quantitative variables are things that are measured using numbers. Quantitative continuous variables can take any numerical value, including fractions and decimals. For example: time, temperature, length and weight, and anything that is derived from them. On the other hand, quantitative discrete variables are things that are counted, so only take whole-number values. For example you may count the number of people infected with a disease, or the number of cars that cross a bridge.

The main difference is that a sufficiently accurate measuring device can measure quantitative continuous random variable to any value in a range, while discrete will always have gaps between the numbers, for example you can’t have 3.5 people infected with a disease.

In the mpg dataset, the variable cyl stands for the number of cylinders in the car. As cylinders are countable (rather than measured continuously), this variable is a quantitative discrete variable. The variable displ stands for an engine's displacement, which is a volume (in litres). Therefore, being something that is measured, it is a quantitative continuous variable.

Notice that as displacement is measured to the nearest 0.1 of a litre, this variable ends up being discretised. However, it is still considered to be a continuous variable, as theoretically, volume could be recorded to any number of decimal places. On the other hand, you can never have half a cylinder, so the cyl variable will always be quantitative discrete.

So why do we care about this? Because if we know the type of variable, then we know how to deal with it statistically.



No comments:

Post a Comment

Please keep your comments relevant.
Comments with external links and adult words will be filtered.