Sunday, December 20, 2020

Sources of data available both inside and outside of the organization and Data source terminologies

Internal and External data sources available in a company

Potential sources of data can vary across scenarios, and it would be easy to miss an opportunity. One way this can be alleviated is by keeping in mind a comprehensive taxonomy of common data sources.

Internal Data Sources Available to an Organization

 Internal data sets and data sources are those that can be derived in whole from the existing data or activities that exist entirely within the organization. Breakdowns of the different categories of potential data sources within an organization are reproduced here for your review and reference.

Existing data sets already being generated and/or stored in digital form can include the following.

  • Structured data (e.g., personnel or accounting records, sales and transactions)
  • Semi-structured or unstructured data (e.g., a data warehouse, or social media posts made by the organization)
  • Metadata of existing data sets

Definition: The term metadata typically refers to information about a data set or individual entries in that data set. Most data sets have at least some metadata associated with them. Examples include the date and time of creation, how the data is structured or organized, or permissions that determine who can access or modify it.

Assets and business activities within the organization that can potentially be surveyed, measured, and/or tracked to generate new data sets include those enumerated below.

  • Tracking information and measurements
    • Existing assets (e.g., current inventory of manufactured goods)
    • Internal events (e.g., sales figures for products)
    • Interactions with other organizations (e.g., subcontractors or partner organizations)
    • External opportunities
  • Exploratory or diagnostic experiments conducted within the organization
  • Crowd-sourced data

External Data Sources Available to an Organization

Breakdowns of the different categories of potential external data sources are reproduced here for your review and reference.
  • Acquired or purchased data sets
    • Data sets provided in structured form by commercial organizations (e.g., Nielsen Holdings) that may be relevant to the business question
    • Data streams of structured information to which it may be possible to subscribe for a fee (e.g., website traffic analysis services such as Google Analytics)
  • Data provided by customers
    • Social networking and social media services often provide APIs that can be used to collect information posted by customers (both on their own accounts and on the organization’s accounts)
    • Direct communications from customers, including email, can be a rich data source
  • Free, open, or publicly accessible data sources
    • Some private organizations provide data sets via online portals (though it is important to check any restrictions on the use of that data in the license that accompanies it)
    • Many data sets are provided by governments, government agencies (such as the US Census Bureau) and non-profit organizations via open data portals
  • Other data published or publicly accessible (e.g., online) in unstructured form, as long as its use does not violate the applicable terms and licenses
    • Data that can be manually collected and curated into a structured form
    • Data published online that can be automatically parsed and collected via web scraping (usually a workflow that a data engineer is best-suited to implement)

 

Once we have chosen the assets and activities of interest within the organization that can act as data sources, we need to identify the means available to the organization to collect and possibly store the desired data. The question of what resources are required to collect and store the new or existing data is driven in part by the characteristics of the business question being addressed. For example, is a one-time decision being made, or is a new and ongoing process being introduced within the organization? Will a new unit within the organization be responsible for acting on these data sources? We introduce several characterizations of data sources that can help navigate these issues.


Definition: A static or one-time data source or data set consists of a fixed quantity of data that can be retrieved once. Such a data set may have been collected via a one-time survey or study. It may also have been obtained via the commissioning of an outside consulting firm or through a one-time purchase from a vendor.

Definition: A transactional data source or data set typically consists of entries that describe events (i.e., changes that occur as a result of a transaction) that have a specified time and may refer to one or more reference objects. These events normally correspond to verbs.

Typical categories of transactional data include financial (e.g., invoices and payments), operational (e.g., tasks assigned or completed), and logistical (e.g., orders and deliveries).

Definition: A real-time or streaming source of data (also known as a data feed) is one from which data is being generated continuously at some (possibly high) rate.

In some cases, the streaming data may be delivered directly to the organization, in which case the organization must maintain an automated infrastructure that can determine where to store this data by provisioning internal (or cloud-based) storage resources as appropriate. In other cases, an organization may have the option to sample a data stream as necessary.

Definition: A data warehouse is a system used for retrieving, integrating, and storing data from multiple sources so that it may be used for reporting and analysis.

A data warehouse normally integrates data from a number of sources (including static, transactional, and streaming data), and some amount of quality control or cleansing may be performed on this data before it is used.

Definition: The provenance of a data set or stream (or an item therein) is a record of its origin and, possibly, its lifespan. This includes to whom the data set can be attributed, at what location and time it was created, from what other data sets it was derived, and by what process this derivation was accomplished.

Data provenance can be tracked at a coarse granularity  (i.e., for entire data sets) or at a fine granularity (i.e., for every individual entry within the data set). The provenance information associated with a data set, or an entry within a data set, could constitute a part of its metadata.

The World Wide Web Consortium’s PROV standard lays out a well-defined format for provenance documents (which has been implemented in many machine-readable digital representations and for which many software tools exist that allow users to edit, combine, and visualize provenance documents). Each document can be a record of the provenance of one or more data sets, and is broken down into entities (e.g., data sets, individual data entries, reports, and so on), actors (e.g., data analysts, or automated algorithms making decisions according to a schedule and/or some set of rules), and activities (e.g., events that generate data entries or data sets, data analyses executed using certain tools, data-driven operational decisions made by the organization, and so on).

 

 

No comments:

Post a Comment

Please keep your comments relevant.
Comments with external links and adult words will be filtered.