Showing posts with label Business Intelligence. Show all posts
Showing posts with label Business Intelligence. Show all posts

Sunday, December 20, 2020

Common Data Transformations

Forms of common data transformation

The basic data transformation types are enumerated and discussed in detail below:

Union transformations take two data sets as their input and produces an output data set that contains all entries found in both data sets. The output data set must have at least as many records as one of the two input data sets.

Intersection transformations take two data sets as their input and produces an output data set that contains only those entries found in both input data sets. The output data set must have at most as many records as one of the two input data sets.

Difference transformations take two data sets as their input and produce a data set that contains only those records found in the first data set but not the second data set. The output data set must have at most as many records as the number of records in the first input data set.

Selection transformations involve extracting some portion of the data based on zero or more filtering conditions or criteria. A selection transformation might return the entire original data set (e.g., if the criteria are already satisfied by all the records in the input data set), but it cannot return a result that is larger than the original data set.

A filtering condition within a selection transformation usually consists of a logical expression that is either true or false for each record. The condition can reference the values found in each record using their corresponding attribute/column names; it can also contain arithmetic operators (addition, subtraction, multiplication, division, and so on), relational operators (equality, comparison, and so on), and logical operations (such as "and" and "or"). In some database management systems, more complex conditions can be defined (e.g., ones that do text search).

Projection transformations involve converting every record in a data set in some way to produce a new data set. A projection transformation always produces the same number of records in the output data set as there were in the input data set. The conversion itself might throw away some attributes/columns or might introduce new ones. The definition of a projection transformation can use arithmetic and other operations to transform the values inside the input data set's records into the values within the records of the output data set.

Renaming transformations simply rename one or more of the attributes/columns. They are usually combined with projection and selection transformations so that the output data sets can have informative attribute/column names.

The advanced data transformation types are enumerated and discussed in detail below for your reference and review.

Aggregation transformations involve combining the values within a particular attribute/column across all records in a data setExamples of tasks for which this may be useful include counting the number of records in a data set, taking the sum of all values in a column, finding the maximum value across all values in a column, and so on. In its basic form, an aggregation transformation produces a data set with exactly one record.

In some languages and database management systems, it is possible to group the records using a particular attribute (which we call the grouping attribute) when performing an aggregation. In this case, the aggregation operation is only applied to those collections of records that have the same grouping attribute. In this case, the number of records in the output data set corresponds to the number of unique values found in the grouping attribute/column.

Join transformations take two input data sets and return their Cartesian product. Thus, the number of entries in the output data set may be larger (even significantly larger) than the number of entries in each of the two input data sets. It is common to combine join transformations with selection transformations in order to pair corresponding records using their identifiers (or other attributes) even if the records are found in different data sets. One example of this might be matching all purchase records in a purchases data set with all customer records in a customers data set.

 

Sources of data available both inside and outside of the organization and Data source terminologies

Internal and External data sources available in a company

Potential sources of data can vary across scenarios, and it would be easy to miss an opportunity. One way this can be alleviated is by keeping in mind a comprehensive taxonomy of common data sources.

Internal Data Sources Available to an Organization

 Internal data sets and data sources are those that can be derived in whole from the existing data or activities that exist entirely within the organization. Breakdowns of the different categories of potential data sources within an organization are reproduced here for your review and reference.

Existing data sets already being generated and/or stored in digital form can include the following.

  • Structured data (e.g., personnel or accounting records, sales and transactions)
  • Semi-structured or unstructured data (e.g., a data warehouse, or social media posts made by the organization)
  • Metadata of existing data sets

Definition: The term metadata typically refers to information about a data set or individual entries in that data set. Most data sets have at least some metadata associated with them. Examples include the date and time of creation, how the data is structured or organized, or permissions that determine who can access or modify it.

Assets and business activities within the organization that can potentially be surveyed, measured, and/or tracked to generate new data sets include those enumerated below.

  • Tracking information and measurements
    • Existing assets (e.g., current inventory of manufactured goods)
    • Internal events (e.g., sales figures for products)
    • Interactions with other organizations (e.g., subcontractors or partner organizations)
    • External opportunities
  • Exploratory or diagnostic experiments conducted within the organization
  • Crowd-sourced data

External Data Sources Available to an Organization

Breakdowns of the different categories of potential external data sources are reproduced here for your review and reference.
  • Acquired or purchased data sets
    • Data sets provided in structured form by commercial organizations (e.g., Nielsen Holdings) that may be relevant to the business question
    • Data streams of structured information to which it may be possible to subscribe for a fee (e.g., website traffic analysis services such as Google Analytics)
  • Data provided by customers
    • Social networking and social media services often provide APIs that can be used to collect information posted by customers (both on their own accounts and on the organization’s accounts)
    • Direct communications from customers, including email, can be a rich data source
  • Free, open, or publicly accessible data sources
    • Some private organizations provide data sets via online portals (though it is important to check any restrictions on the use of that data in the license that accompanies it)
    • Many data sets are provided by governments, government agencies (such as the US Census Bureau) and non-profit organizations via open data portals
  • Other data published or publicly accessible (e.g., online) in unstructured form, as long as its use does not violate the applicable terms and licenses
    • Data that can be manually collected and curated into a structured form
    • Data published online that can be automatically parsed and collected via web scraping (usually a workflow that a data engineer is best-suited to implement)

 

Once we have chosen the assets and activities of interest within the organization that can act as data sources, we need to identify the means available to the organization to collect and possibly store the desired data. The question of what resources are required to collect and store the new or existing data is driven in part by the characteristics of the business question being addressed. For example, is a one-time decision being made, or is a new and ongoing process being introduced within the organization? Will a new unit within the organization be responsible for acting on these data sources? We introduce several characterizations of data sources that can help navigate these issues.


Definition: A static or one-time data source or data set consists of a fixed quantity of data that can be retrieved once. Such a data set may have been collected via a one-time survey or study. It may also have been obtained via the commissioning of an outside consulting firm or through a one-time purchase from a vendor.

Definition: A transactional data source or data set typically consists of entries that describe events (i.e., changes that occur as a result of a transaction) that have a specified time and may refer to one or more reference objects. These events normally correspond to verbs.

Typical categories of transactional data include financial (e.g., invoices and payments), operational (e.g., tasks assigned or completed), and logistical (e.g., orders and deliveries).

Definition: A real-time or streaming source of data (also known as a data feed) is one from which data is being generated continuously at some (possibly high) rate.

In some cases, the streaming data may be delivered directly to the organization, in which case the organization must maintain an automated infrastructure that can determine where to store this data by provisioning internal (or cloud-based) storage resources as appropriate. In other cases, an organization may have the option to sample a data stream as necessary.

Definition: A data warehouse is a system used for retrieving, integrating, and storing data from multiple sources so that it may be used for reporting and analysis.

A data warehouse normally integrates data from a number of sources (including static, transactional, and streaming data), and some amount of quality control or cleansing may be performed on this data before it is used.

Definition: The provenance of a data set or stream (or an item therein) is a record of its origin and, possibly, its lifespan. This includes to whom the data set can be attributed, at what location and time it was created, from what other data sets it was derived, and by what process this derivation was accomplished.

Data provenance can be tracked at a coarse granularity  (i.e., for entire data sets) or at a fine granularity (i.e., for every individual entry within the data set). The provenance information associated with a data set, or an entry within a data set, could constitute a part of its metadata.

The World Wide Web Consortium’s PROV standard lays out a well-defined format for provenance documents (which has been implemented in many machine-readable digital representations and for which many software tools exist that allow users to edit, combine, and visualize provenance documents). Each document can be a record of the provenance of one or more data sets, and is broken down into entities (e.g., data sets, individual data entries, reports, and so on), actors (e.g., data analysts, or automated algorithms making decisions according to a schedule and/or some set of rules), and activities (e.g., events that generate data entries or data sets, data analyses executed using certain tools, data-driven operational decisions made by the organization, and so on).

 

 

Saturday, December 19, 2020

Descriptive, Predictive, and Prescriptive analytics classification questions

Descriptive, Predictive, and Prescriptive analytics classification questions

 Please read about Descriptive, Predictive, and Prescriptive analytics here: https://triksbuddy.blogspot.com/2020/12/what-are-different-types-of-analytics.html

And try to answer the following questions and match with correct answer below questions:

1. A classification of our customers into four quantiles according to their profitability in the last four quarters. 

A. Descriptive
B. Predictive
C. Prescriptive

 

2.  A classification of our customers into four quantiles according to their expected profitability in the following four quarters.

A. Descriptive
B. Predictive
C. Prescriptive

 

3. A model that assigns a credit limit to each customer such that it optimizes our bank’s expected profits in the next four quarters.

A. Descriptive
B. Predictive
C. Prescriptive

 

4. A list of our best 10 customers on the basis of their sales growth in the last quarter.

A. Descriptive
B. Predictive
C. Prescriptive

 

5. A list of the 10 customers that are most likely to leave our company in the next two quarters.

A. Descriptive
B. Predictive
C. Prescriptive

 

6. A model that assigns to each credit card transaction a score that represents its probability of being a fraudulent transaction.

A. Descriptive
B. Predictive
C. Prescriptive

 

7. A model that outputs a preventive maintenance schedule of airplane engines such that it minimizes our airline’s annual maintenance and downtime expenditure.

A. Descriptive
B. Predictive
C. Prescriptive

 

8. A model that schedules the timing of the posting of an individual’s tweets so as to maximize the daily number of associated retweets.

A. Descriptive
B. Predictive
C. Prescriptive

 

9. A list of students that are in high risk of dropping out of our university in the next two semesters.

A. Descriptive
B. Predictive
C. Prescriptive

 

10. A model that suggests an individualized student degree completion path such that it minimizes the likelihood that the student will quit her studies before competing her degree.

A. Descriptive
B. Predictive
C. Prescriptive

 

 

 Answer: 

1. A 

2. B  

3. C  

4.  A

5. B

6. B

7. C

8. C

9. B

10. C

 

Monday, July 1, 2019

What are the usage of Data Analytics?

Usage of Data Analytics


Broadly, predictive analytics can be used to:

1. Description: Provide an overview and summary of the existing state of the world. For example: what is the average age of our customers?How much do they spend, on average, each time they buy? What is the distribution of amounts spent? etc.

2. Comparison: is group A different in some meaningful way from group B, and if so, in what way and by how much? Examples: Do men spend more than women? Does one advertisement work better than others?

3. Clustering / Grouping / Co-occurrence: Group together things that are “similar” according to some definition of “similar”. Example: Are there groups of customers with similar buying/purchase habits? If you know some marketing, cluster analysis is what is used to divide customers into “segments”.

4. Classification: assign a probability that something belongs to 1 of several mutually exclusive classes. Example: Is this credit card trans-action fraudulent? (A: probability Yes/No) Will this person donate to my charity? (A: probability Yes/No) Is this person suffering from a heart attack, or some other mimic condition? (A: probability of Attack)

5. Prediction: predict the most likely value of a continuous variable.Example: what will sales be next quarter? How much will this group of customers spend over the next year? What will be the market share of our new product?
 

What are the applications of Data Analytics?

Applications of Data Analyticsˆ 
  • Policing/Security
  • Transportationˆ
  • Fraud and Risk Detection
  • Delivery Logistics
  • Proper Spendingˆ
  • City Planning
  • Healthcare
  • Internet/web search
  • Basket Analysis
  • Sales Forecasting
  • Inventory Planning

What is Data Analytics? Write down three ways that data analytics is impacting business today.

What is Data Analytics?
Data Analytics mainly helps you to take rapid and better decision based on data.

Data as a collection of facts, observations or other information related to a particular question or problem.

Data can be structured or unstructured. Structured data is information with a high degree of organization that could be included in databases or spreadsheets and is easily searchable by simple search engine algorithms.

Unstructured data is the opposite and is usually text heavy though it may contain video, data or numbers and facts as well. Think of an open field text box that allows you to provide additional comments on a survey.Adding to the complexity Data can also come from a variety of internal and external sources for organizations.


Analytics is the science of examining raw data in order to draw conclusions about the information.

It’s an exciting field, and is dramatically impacting how organizations in many industries are making decisions. The availability of huge volumes of structured and unstructured data sets, combined with advanced computing capabilities. Low cost storage and powerful visualization technology is enabling organizations to gain from market research and social media, to the network of physical objects we call the internet of things. The world we live in today is creating a constant and ever-increasing stream of data. For most organizations, the data they can access is increasing at a rate of 40%each year which creates significant challenges in the way data is captured and secured, organized, analyzed and reported.

 Three ways that data analytics is impacting business today:
Let’s quickly touch on three ways that data analytics is impacting business today.

First, data is enabling new products and services, creating markets that didn’t previously exist and bringing new capabilities to existing markets.Wearables, such as your Fitbit or Apple watch are some examples of new products.

Second, it is disrupting existing markets with innovative upstarts unseating traditionally secure businesses, think of Uber.

Third, data and analytics is driving increased efficiency. For example,retailers have the ability to automate and optimize their supply chain.

In short data is providing the organizations the ability to identify growth opportunities, drive innovation, operate more efficiently, and manage risk in new ways.