Home > Uncategorized > The Data Refinery

The Data Refinery

Refinery Photo

Vermin Inc / Free Photos

Analysts often talk about the process of turning corporate data into information as an information factory.   The factory analogy is a good one in an environment where the raw materials for the data warehouse are sourced from applications that have already refined and packaged the data into transactions or tables.  Much of the processing of the data is “assembly” on a production line – joining data so the whole is more useful than the sum of the parts.

The introduction of lots of event data into the mix on the front end as we see in today’s “big data” makes me more inclined to think of it as a data refinery.  This data is more like raw crude oil that has to be transformed into other substances like gasoline and kerosene before they are useful.

This thinking was reinforced by a post entitled “The Data Stack: A Structured Approach” by Gil Elbaz at Factual.com where he describes his perspective on the technology layers required to make data useful.  He’s building the mother of all data sources (though that description doesn’t really do his business justice).

For the enterprise, I think the stack today maybe looks something like the following:

Raw Material – Transaction data from enterprise applications and event data from other systems

Sourcing – Determining where to get the data, when, and at what point in the lifecycle of the data at it’s source.   Do you want “raw” event data from the source, a transaction log that sequences and formats the data, or will you rely on an application or service provider who manipulates the data before you get it?

Connection  – The method and timing for acquiring the data.  Will you rely on file transfers or web application programming interfaces? Is the data needed “real time”, or on a periodic basis?  What is the extraction method from the data source?

Quality/Provenance – When and how to determine the quality of the data and how to retain source information about the data so that subsequent analysts will have some measure of it’s reliability

Collection – Where to put the data.   This tends to be where most of our thinking seems to go – what database will we use?   This is where there can be a real divergence between what we’ve traditionally used as an operational data store for event data and the data warehouse for transaction data.

Combination – The information that is common between data that will be used to join them.  Determining or deriving keys to make this possible can be really hard for event data that doesn’t come from traditional enterprise applications.  Event data, for example, may not contain explicit identifiers like <customer number> that can be used to join the data to your CRM or Marketing databases.

Derivation – How new data can be derived from existing data.   For example, a point of sale event gets a new segmentation code appended to it based on the contents of the entire market basket purchased by the customer.

Access Methods – The methods that will be used get access to the data.   Will it be made available to other programs and systems or does it need to be made available in a human readable format.

Presentation – The reporting and analytic methods that are used to understand the data

Stewardship – The method for identifying who is responsible for the quality and availability of the data all along the refining process.   Who do you call when you want to know what the data in a field represents and how it got there?

Security – The protection of data as a corporate asset, and the protection of personal information by law.

I’m sure I’ve left some things out, but I will have to think about it some more.

Categories: Uncategorized
  1. August 8, 2012 at 11:25 am

    I like the “refinery” idea… especially if the resulting “products” make it to a “factory” where “consumer” products are manufactured.

    Maybe your derivation layer is the factory? Certainly, there is lots going on there as analysts, statisticians, and data scientists all labor to craft value from the raw materials?

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: