Showing posts with label Data Science. Show all posts
Showing posts with label Data Science. Show all posts

Sunday, December 20, 2020

Data Cleaning, Normalization, and Enhancement

Data Cleaning, Normalization, and Enhancement

Data cleaning, normalization, and enhancement techniques aim to address the quality of data sets. This can be measured in a number of ways; we will define each of them below by referring to the concepts we have seen in the previous sections.

  • Validity refers to whether the values in the data set are of acceptable data types (e.g., integer, fractional number, or text), fall within acceptable ranges (e.g., between 0 and 100), are from an approved list of options (e.g., "Approved" or "Rejected"), are non-empty, and so on.
  • Consistency refers to whether there are contradictory entries within a single data set or across data sets (e.g., if the same customer identifier is associated with different values in the an address column).
  • Uniformity refers to whether the values found in records represent measurements in the same units (within the data set or across data sets).
  • Accuracy refers to how well the values in each record represent the properties of the real-world object to which the record corresponds. In general, improving accuracy requires some external reference against which the data can be compared.
  • Completeness refers to whether there are any missing values in the records. Missing data is very difficult to replace without going back and collecting it again; however, it is possible to introduce new values (such as "Unknown") as placeholders that reflect the fact that information is missing.


Common Data Transformations

Forms of common data transformation

The basic data transformation types are enumerated and discussed in detail below:

Union transformations take two data sets as their input and produces an output data set that contains all entries found in both data sets. The output data set must have at least as many records as one of the two input data sets.

Intersection transformations take two data sets as their input and produces an output data set that contains only those entries found in both input data sets. The output data set must have at most as many records as one of the two input data sets.

Difference transformations take two data sets as their input and produce a data set that contains only those records found in the first data set but not the second data set. The output data set must have at most as many records as the number of records in the first input data set.

Selection transformations involve extracting some portion of the data based on zero or more filtering conditions or criteria. A selection transformation might return the entire original data set (e.g., if the criteria are already satisfied by all the records in the input data set), but it cannot return a result that is larger than the original data set.

A filtering condition within a selection transformation usually consists of a logical expression that is either true or false for each record. The condition can reference the values found in each record using their corresponding attribute/column names; it can also contain arithmetic operators (addition, subtraction, multiplication, division, and so on), relational operators (equality, comparison, and so on), and logical operations (such as "and" and "or"). In some database management systems, more complex conditions can be defined (e.g., ones that do text search).

Projection transformations involve converting every record in a data set in some way to produce a new data set. A projection transformation always produces the same number of records in the output data set as there were in the input data set. The conversion itself might throw away some attributes/columns or might introduce new ones. The definition of a projection transformation can use arithmetic and other operations to transform the values inside the input data set's records into the values within the records of the output data set.

Renaming transformations simply rename one or more of the attributes/columns. They are usually combined with projection and selection transformations so that the output data sets can have informative attribute/column names.

The advanced data transformation types are enumerated and discussed in detail below for your reference and review.

Aggregation transformations involve combining the values within a particular attribute/column across all records in a data setExamples of tasks for which this may be useful include counting the number of records in a data set, taking the sum of all values in a column, finding the maximum value across all values in a column, and so on. In its basic form, an aggregation transformation produces a data set with exactly one record.

In some languages and database management systems, it is possible to group the records using a particular attribute (which we call the grouping attribute) when performing an aggregation. In this case, the aggregation operation is only applied to those collections of records that have the same grouping attribute. In this case, the number of records in the output data set corresponds to the number of unique values found in the grouping attribute/column.

Join transformations take two input data sets and return their Cartesian product. Thus, the number of entries in the output data set may be larger (even significantly larger) than the number of entries in each of the two input data sets. It is common to combine join transformations with selection transformations in order to pair corresponding records using their identifiers (or other attributes) even if the records are found in different data sets. One example of this might be matching all purchase records in a purchases data set with all customer records in a customers data set.

 

Sources of data available both inside and outside of the organization and Data source terminologies

Internal and External data sources available in a company

Potential sources of data can vary across scenarios, and it would be easy to miss an opportunity. One way this can be alleviated is by keeping in mind a comprehensive taxonomy of common data sources.

Internal Data Sources Available to an Organization

 Internal data sets and data sources are those that can be derived in whole from the existing data or activities that exist entirely within the organization. Breakdowns of the different categories of potential data sources within an organization are reproduced here for your review and reference.

Existing data sets already being generated and/or stored in digital form can include the following.

  • Structured data (e.g., personnel or accounting records, sales and transactions)
  • Semi-structured or unstructured data (e.g., a data warehouse, or social media posts made by the organization)
  • Metadata of existing data sets

Definition: The term metadata typically refers to information about a data set or individual entries in that data set. Most data sets have at least some metadata associated with them. Examples include the date and time of creation, how the data is structured or organized, or permissions that determine who can access or modify it.

Assets and business activities within the organization that can potentially be surveyed, measured, and/or tracked to generate new data sets include those enumerated below.

  • Tracking information and measurements
    • Existing assets (e.g., current inventory of manufactured goods)
    • Internal events (e.g., sales figures for products)
    • Interactions with other organizations (e.g., subcontractors or partner organizations)
    • External opportunities
  • Exploratory or diagnostic experiments conducted within the organization
  • Crowd-sourced data

External Data Sources Available to an Organization

Breakdowns of the different categories of potential external data sources are reproduced here for your review and reference.
  • Acquired or purchased data sets
    • Data sets provided in structured form by commercial organizations (e.g., Nielsen Holdings) that may be relevant to the business question
    • Data streams of structured information to which it may be possible to subscribe for a fee (e.g., website traffic analysis services such as Google Analytics)
  • Data provided by customers
    • Social networking and social media services often provide APIs that can be used to collect information posted by customers (both on their own accounts and on the organization’s accounts)
    • Direct communications from customers, including email, can be a rich data source
  • Free, open, or publicly accessible data sources
    • Some private organizations provide data sets via online portals (though it is important to check any restrictions on the use of that data in the license that accompanies it)
    • Many data sets are provided by governments, government agencies (such as the US Census Bureau) and non-profit organizations via open data portals
  • Other data published or publicly accessible (e.g., online) in unstructured form, as long as its use does not violate the applicable terms and licenses
    • Data that can be manually collected and curated into a structured form
    • Data published online that can be automatically parsed and collected via web scraping (usually a workflow that a data engineer is best-suited to implement)

 

Once we have chosen the assets and activities of interest within the organization that can act as data sources, we need to identify the means available to the organization to collect and possibly store the desired data. The question of what resources are required to collect and store the new or existing data is driven in part by the characteristics of the business question being addressed. For example, is a one-time decision being made, or is a new and ongoing process being introduced within the organization? Will a new unit within the organization be responsible for acting on these data sources? We introduce several characterizations of data sources that can help navigate these issues.


Definition: A static or one-time data source or data set consists of a fixed quantity of data that can be retrieved once. Such a data set may have been collected via a one-time survey or study. It may also have been obtained via the commissioning of an outside consulting firm or through a one-time purchase from a vendor.

Definition: A transactional data source or data set typically consists of entries that describe events (i.e., changes that occur as a result of a transaction) that have a specified time and may refer to one or more reference objects. These events normally correspond to verbs.

Typical categories of transactional data include financial (e.g., invoices and payments), operational (e.g., tasks assigned or completed), and logistical (e.g., orders and deliveries).

Definition: A real-time or streaming source of data (also known as a data feed) is one from which data is being generated continuously at some (possibly high) rate.

In some cases, the streaming data may be delivered directly to the organization, in which case the organization must maintain an automated infrastructure that can determine where to store this data by provisioning internal (or cloud-based) storage resources as appropriate. In other cases, an organization may have the option to sample a data stream as necessary.

Definition: A data warehouse is a system used for retrieving, integrating, and storing data from multiple sources so that it may be used for reporting and analysis.

A data warehouse normally integrates data from a number of sources (including static, transactional, and streaming data), and some amount of quality control or cleansing may be performed on this data before it is used.

Definition: The provenance of a data set or stream (or an item therein) is a record of its origin and, possibly, its lifespan. This includes to whom the data set can be attributed, at what location and time it was created, from what other data sets it was derived, and by what process this derivation was accomplished.

Data provenance can be tracked at a coarse granularity  (i.e., for entire data sets) or at a fine granularity (i.e., for every individual entry within the data set). The provenance information associated with a data set, or an entry within a data set, could constitute a part of its metadata.

The World Wide Web Consortium’s PROV standard lays out a well-defined format for provenance documents (which has been implemented in many machine-readable digital representations and for which many software tools exist that allow users to edit, combine, and visualize provenance documents). Each document can be a record of the provenance of one or more data sets, and is broken down into entities (e.g., data sets, individual data entries, reports, and so on), actors (e.g., data analysts, or automated algorithms making decisions according to a schedule and/or some set of rules), and activities (e.g., events that generate data entries or data sets, data analyses executed using certain tools, data-driven operational decisions made by the organization, and so on).

 

 

Saturday, December 19, 2020

Descriptive, Predictive, and Prescriptive analytics classification questions

Descriptive, Predictive, and Prescriptive analytics classification questions

 Please read about Descriptive, Predictive, and Prescriptive analytics here: https://triksbuddy.blogspot.com/2020/12/what-are-different-types-of-analytics.html

And try to answer the following questions and match with correct answer below questions:

1. A classification of our customers into four quantiles according to their profitability in the last four quarters. 

A. Descriptive
B. Predictive
C. Prescriptive

 

2.  A classification of our customers into four quantiles according to their expected profitability in the following four quarters.

A. Descriptive
B. Predictive
C. Prescriptive

 

3. A model that assigns a credit limit to each customer such that it optimizes our bank’s expected profits in the next four quarters.

A. Descriptive
B. Predictive
C. Prescriptive

 

4. A list of our best 10 customers on the basis of their sales growth in the last quarter.

A. Descriptive
B. Predictive
C. Prescriptive

 

5. A list of the 10 customers that are most likely to leave our company in the next two quarters.

A. Descriptive
B. Predictive
C. Prescriptive

 

6. A model that assigns to each credit card transaction a score that represents its probability of being a fraudulent transaction.

A. Descriptive
B. Predictive
C. Prescriptive

 

7. A model that outputs a preventive maintenance schedule of airplane engines such that it minimizes our airline’s annual maintenance and downtime expenditure.

A. Descriptive
B. Predictive
C. Prescriptive

 

8. A model that schedules the timing of the posting of an individual’s tweets so as to maximize the daily number of associated retweets.

A. Descriptive
B. Predictive
C. Prescriptive

 

9. A list of students that are in high risk of dropping out of our university in the next two semesters.

A. Descriptive
B. Predictive
C. Prescriptive

 

10. A model that suggests an individualized student degree completion path such that it minimizes the likelihood that the student will quit her studies before competing her degree.

A. Descriptive
B. Predictive
C. Prescriptive

 

 

 Answer: 

1. A 

2. B  

3. C  

4.  A

5. B

6. B

7. C

8. C

9. B

10. C

 

What are different types of analytics? Describe different types of analytics.


Analytics: 

The term analytics is used to characterize a vast array of methods that use data to help make better business decisions. And there are many ways to organize them into subcategories. The simplest way is to distinguish analytics into three large classes

  • Descriptive Analytics
  • Predictive Analytics, and 
  • Prescriptive Analytics


Descriptive Analytics: The most basic form of analytics are descriptive analytics. The simplest way to define descriptive analytics is that it answers the question what has happened. Descriptive analytics typically condenses large amounts of historical or real-time data into smaller, more meaningful nuggets of information.

For example, in an internet marketing context, descriptive analytics could be used to summarize a large number of search, display and social media advertising campaigns into a smaller set of metrics that shows the average click-through rate, conversion rate, and the return on investment of each of these three advertising channels.



The main objective of descriptive analytics is to find out the reasons behind success or failure in the past. The vast majority of big data analytics used by organizations falls into the category of descriptive analytics.


Predictive Analytics: The next class of analytics, predictive analytics, uses data from the past to predict what will happen in the future.

For example, suppose we would like to predict the likelihood that a new prospective customer will respond to a promotional email campaign. By analyzing past data that includes situations where prospects responded and did not respond to similar campaigns in the past, analytics can help identify what distinguishes those prospects who responded from those who did not. On the basis of this data, a model can be built that can help assess the probability that the new prospect will respond to a future campaign. 


As we can see, predictive analytics is based on a solid understanding of what happened in the past. So most organizations are deploying them after they have mastered the art and science of descriptive analytics.


Prescriptive Analytics: Armed with models of the past and forecasts of the future, organizations can then venture to the more advanced level of analytics, prescriptive analytics.


This class of method is using optimization algorithms to determine the set of actions that optimize a desirable objective, given predictions of what is likely to happen

Referring to the previous example, assuming that we have built predictive models for a number of different campaigns, a prescriptive analytics model could be used to determine which campaign should be sent to which prospects in order to maximize our expected sales and stay within our marketing budget.


As organizations gain experience and skills with data-driven business decision-making, they typically progress from descriptive, to predictive, and finally to using prescriptive analytics to inform decisions and actions in a growing number of functions.