Technology & Project Management tips and tricks: Data Cleaning, Normalization, and Enhancement

Sunday, December 20, 2020

Data Cleaning, Normalization, and Enhancement

Data cleaning, normalization, and enhancement techniques aim to address the quality of data sets. This can be measured in a number of ways; we will define each of them below by referring to the concepts we have seen in the previous sections.

Validity refers to whether the values in the data set are of acceptable data types (e.g., integer, fractional number, or text), fall within acceptable ranges (e.g., between 0 and 100), are from an approved list of options (e.g., "Approved" or "Rejected"), are non-empty, and so on.
Consistency refers to whether there are contradictory entries within a single data set or across data sets (e.g., if the same customer identifier is associated with different values in the an address column).
Uniformity refers to whether the values found in records represent measurements in the same units (within the data set or across data sets).
Accuracy refers to how well the values in each record represent the properties of the real-world object to which the record corresponds. In general, improving accuracy requires some external reference against which the data can be compared.
Completeness refers to whether there are any missing values in the records. Missing data is very difficult to replace without going back and collecting it again; however, it is possible to introduce new values (such as "Unknown") as placeholders that reflect the fact that information is missing.