Data Cleaning, Normalization, and Enhancement
Data cleaning, normalization, and enhancement techniques aim to address the quality of data sets. This can be measured in a number of ways; we will define each of them below by referring to the concepts we have seen in the previous sections.
- Validity refers to whether the values in the data set are of acceptable data types (e.g., integer, fractional number, or text), fall within acceptable ranges (e.g., between 0 and 100), are from an approved list of options (e.g., "Approved" or "Rejected"), are non-empty, and so on.
- Consistency refers to whether there are contradictory entries within a single data set or across data sets (e.g., if the same customer identifier is associated with different values in the an address column).
- Uniformity refers to whether the values found in records represent measurements in the same units (within the data set or across data sets).
- Accuracy refers to how well the values in each record represent the properties of the real-world object to which the record corresponds. In general, improving accuracy requires some external reference against which the data can be compared.
- Completeness refers to whether there are any missing values in the records. Missing data is very difficult to replace without going back and collecting it again; however, it is possible to introduce new values (such as "Unknown") as placeholders that reflect the fact that information is missing.
No comments:
Post a Comment
Please keep your comments relevant.
Comments with external links and adult words will be filtered.