Showing posts with label Data Science. Show all posts

Sunday, December 20, 2020

Data Cleaning, Normalization, and Enhancement

Data cleaning, normalization, and enhancement techniques aim to address the quality of data sets. This can be measured in a number of ways; we will define each of them below by referring to the concepts we have seen in the previous sections.

Validity refers to whether the values in the data set are of acceptable data types (e.g., integer, fractional number, or text), fall within acceptable ranges (e.g., between 0 and 100), are from an approved list of options (e.g., "Approved" or "Rejected"), are non-empty, and so on.
Consistency refers to whether there are contradictory entries within a single data set or across data sets (e.g., if the same customer identifier is associated with different values in the an address column).
Uniformity refers to whether the values found in records represent measurements in the same units (within the data set or across data sets).
Accuracy refers to how well the values in each record represent the properties of the real-world object to which the record corresponds. In general, improving accuracy requires some external reference against which the data can be compared.
Completeness refers to whether there are any missing values in the records. Missing data is very difficult to replace without going back and collecting it again; however, it is possible to introduce new values (such as "Unknown") as placeholders that reflect the fact that information is missing.

Common Data Transformations

Forms of common data transformation

The basic data transformation types are enumerated and discussed in detail below:

Union transformations take two data sets as their input and produces an output data set that contains all entries found in both data sets. The output data set must have at least as many records as one of the two input data sets.

Intersection transformations take two data sets as their input and produces an output data set that contains only those entries found in both input data sets. The output data set must have at most as many records as one of the two input data sets.

Difference transformations take two data sets as their input and produce a data set that contains only those records found in the first data set but not the second data set. The output data set must have at most as many records as the number of records in the first input data set.

Selection transformations involve extracting some portion of the data based on zero or more filtering conditions or criteria. A selection transformation might return the entire original data set (e.g., if the criteria are already satisfied by all the records in the input data set), but it cannot return a result that is larger than the original data set.

A filtering condition within a selection transformation usually consists of a logical expression that is either true or false for each record. The condition can reference the values found in each record using their corresponding attribute/column names; it can also contain arithmetic operators (addition, subtraction, multiplication, division, and so on), relational operators (equality, comparison, and so on), and logical operations (such as "and" and "or"). In some database management systems, more complex conditions can be defined (e.g., ones that do text search).

Projection transformations involve converting every record in a data set in some way to produce a new data set. A projection transformation always produces the same number of records in the output data set as there were in the input data set. The conversion itself might throw away some attributes/columns or might introduce new ones. The definition of a projection transformation can use arithmetic and other operations to transform the values inside the input data set's records into the values within the records of the output data set.

Renaming transformations simply rename one or more of the attributes/columns. They are usually combined with projection and selection transformations so that the output data sets can have informative attribute/column names.

The advanced data transformation types are enumerated and discussed in detail below for your reference and review.

Aggregation transformations involve combining the values within a particular attribute/column across all records in a data set. Examples of tasks for which this may be useful include counting the number of records in a data set, taking the sum of all values in a column, finding the maximum value across all values in a column, and so on. In its basic form, an aggregation transformation produces a data set with exactly one record.

In some languages and database management systems, it is possible to group the records using a particular attribute (which we call the grouping attribute) when performing an aggregation. In this case, the aggregation operation is only applied to those collections of records that have the same grouping attribute. In this case, the number of records in the output data set corresponds to the number of unique values found in the grouping attribute/column.

Join transformations take two input data sets and return their Cartesian product. Thus, the number of entries in the output data set may be larger (even significantly larger) than the number of entries in each of the two input data sets. It is common to combine join transformations with selection transformations in order to pair corresponding records using their identifiers (or other attributes) even if the records are found in different data sets. One example of this might be matching all purchase records in a purchases data set with all customer records in a customers data set.

Sources of data available both inside and outside of the organization and Data source terminologies

Internal and External data sources available in a company

Potential sources of data can vary across scenarios, and it would be easy to miss an opportunity. One way this can be alleviated is by keeping in mind a comprehensive taxonomy of common data sources.

Internal Data Sources Available to an Organization

Internal data sets and data sources are those that can be derived in whole from the existing data or activities that exist entirely within the organization. Breakdowns of the different categories of potential data sources within an organization are reproduced here for your review and reference.

Existing data sets already being generated and/or stored in digital form can include the following.

Structured data (e.g., personnel or accounting records, sales and transactions)
Semi-structured or unstructured data (e.g., a data warehouse, or social media posts made by the organization)
Metadata of existing data sets

Definition: The term metadata typically refers to information about a data set or individual entries in that data set. Most data sets have at least some metadata associated with them. Examples include the date and time of creation, how the data is structured or organized, or permissions that determine who can access or modify it.

Assets and business activities within the organization that can potentially be surveyed, measured, and/or tracked to generate new data sets include those enumerated below.

Tracking information and measurements

Existing assets (e.g., current inventory of manufactured goods)
Internal events (e.g., sales figures for products)
Interactions with other organizations (e.g., subcontractors or partner organizations)
External opportunities

Exploratory or diagnostic experiments conducted within the organization
Crowd-sourced data

External Data Sources Available to an Organization

Breakdowns of the different categories of potential external data sources are reproduced here for your review and reference.

Acquired or purchased data sets

Data sets provided in structured form by commercial organizations (e.g., Nielsen Holdings) that may be relevant to the business question
Data streams of structured information to which it may be possible to subscribe for a fee (e.g., website traffic analysis services such as Google Analytics)

Data provided by customers

Social networking and social media services often provide APIs that can be used to collect information posted by customers (both on their own accounts and on the organization’s accounts)
Direct communications from customers, including email, can be a rich data source

Free, open, or publicly accessible data sources

Some private organizations provide data sets via online portals (though it is important to check any restrictions on the use of that data in the license that accompanies it)
Many data sets are provided by governments, government agencies (such as the US Census Bureau) and non-profit organizations via open data portals

Other data published or publicly accessible (e.g., online) in unstructured form, as long as its use does not violate the applicable terms and licenses

Data that can be manually collected and curated into a structured form
Data published online that can be automatically parsed and collected via web scraping (usually a workflow that a data engineer is best-suited to implement)

Once we have chosen the assets and activities of interest within the organization that can act as data sources, we need to identify the means available to the organization to collect and possibly store the desired data. The question of what resources are required to collect and store the new or existing data is driven in part by the characteristics of the business question being addressed. For example, is a one-time decision being made, or is a new and ongoing process being introduced within the organization? Will a new unit within the organization be responsible for acting on these data sources? We introduce several characterizations of data sources that can help navigate these issues.

Definition: A static or one-time data source or data set consists of a fixed quantity of data that can be retrieved once. Such a data set may have been collected via a one-time survey or study. It may also have been obtained via the commissioning of an outside consulting firm or through a one-time purchase from a vendor.

Definition: A transactional data source or data set typically consists of entries that describe events (i.e., changes that occur as a result of a transaction) that have a specified time and may refer to one or more reference objects. These events normally correspond to verbs.

Typical categories of transactional data include financial (e.g., invoices and payments), operational (e.g., tasks assigned or completed), and logistical (e.g., orders and deliveries).

Definition: A real-time or streaming source of data (also known as a data feed) is one from which data is being generated continuously at some (possibly high) rate.

In some cases, the streaming data may be delivered directly to the organization, in which case the organization must maintain an automated infrastructure that can determine where to store this data by provisioning internal (or cloud-based) storage resources as appropriate. In other cases, an organization may have the option to sample a data stream as necessary.

Definition: A data warehouse is a system used for retrieving, integrating, and storing data from multiple sources so that it may be used for reporting and analysis.

A data warehouse normally integrates data from a number of sources (including static, transactional, and streaming data), and some amount of quality control or cleansing may be performed on this data before it is used.

Definition: The provenance of a data set or stream (or an item therein) is a record of its origin and, possibly, its lifespan. This includes to whom the data set can be attributed, at what location and time it was created, from what other data sets it was derived, and by what process this derivation was accomplished.

Data provenance can be tracked at a coarse granularity (i.e., for entire data sets) or at a fine granularity (i.e., for every individual entry within the data set). The provenance information associated with a data set, or an entry within a data set, could constitute a part of its metadata.

The World Wide Web Consortium’s PROV standard lays out a well-defined format for provenance documents (which has been implemented in many machine-readable digital representations and for which many software tools exist that allow users to edit, combine, and visualize provenance documents). Each document can be a record of the provenance of one or more data sets, and is broken down into entities (e.g., data sets, individual data entries, reports, and so on), actors (e.g., data analysts, or automated algorithms making decisions according to a schedule and/or some set of rules), and activities (e.g., events that generate data entries or data sets, data analyses executed using certain tools, data-driven operational decisions made by the organization, and so on).

Saturday, December 19, 2020

Descriptive, Predictive, and Prescriptive analytics classification questions

Please read about Descriptive, Predictive, and Prescriptive analytics here: https://triksbuddy.blogspot.com/2020/12/what-are-different-types-of-analytics.html

And try to answer the following questions and match with correct answer below questions:

1. A classification of our customers into four quantiles according to their profitability in the last four quarters.