Data Is The Program

April 18, 2013 1 comment

DataWhen I was in college I studied data structures with Dr. Mary Loomis.  We were learning how to program data structures of all kinds in memory.   At the time, most programs were written with a basic read-a-record / process-a-record / write-a-record sequence.   The same programming sequence would be repeated thousands of times until an end-of-file occurred.   The emphasis for us as programmers was to make the programs as efficient as possible in terms of CPU time and CPU space.

Dr. Loomis taught us the value of separating the structure and the management of data from the actual program logic as a way of creating even greater efficiency.   This approach launched wildly successful companies like Ingres, Sybase, and Oracle.

Did we lose something in the process of making data a subsystem that serves program logic?

One place to look for an answer to that question might be in the “analytics” space, where data scientists are building models around “big” data.   They often spend significant time acquiring and formatting data in a way that fits their specific needs for the models they are testing.   For them, the program “logic” for their operating models is actually in the data itself.  They are either:

  • Exploring the data to find out what is happening in the real world (what the program is)
  • Creating models in their heads that are then tested and refined against the “program” the data contains
  • Adapting models that have worked on other data sets to see if there is a fit.

In this case, shouldn’t the “structure” in the data be telling us what the “program” should be?   Are we cloaking the “intelligence” and “programs” in the data by forcing it into database models that are prescribed for the efficiencies we needed when CPU’s were relatively slow and had very limited memory?

Why It Matters

The new world of in-memory databases can significantly alter the way we think about data as the source for program logic, as opposed to seeing it as only feedstock for pre-defined program logic.

One of the problems with analytics models is that they can be difficult to deploy into a production environment.   They are often defined with statistical programs that are isolated from production environments with no formal process for implementing the results in production without rewriting production systems in some other programming environment, with associated big budgets and long lead times.  New in-memory data structures that can support both operational and analytical needs means that in the future it will be possible to build analytic models using the same data and infrastructure as operations.  This means that robust modeling languages with the capability of defining and executing program logic can be placed in “production” by flipping a switch from test to live.  Your data scientists and business managers could be working in real time to monitor, model, and change your operations to adapt to economic and social changes as they are occurring.

What’s even more interesting is what happens if we follow the line of thinking that the program is in the data.  In other words, what if our models learn and adapt based on the state of all the data and changes in the data?

Today, it is possible for a data scientist to create a decision tree (all the if-then-else logic) that explains something we see in the real world.   For example, this can be very useful for explaining and predicting customer behavior based on historical data.   This week a major airline lost their entire traffic management system for two hours.   There was no historical basis for modeling the impact this had on other airlines, car rental, and lodging companies.  What if their models were instead, learning models that could have adapted within hours or even minutes to the flow of new customer behaviors, creating new capacity and pricing models to leverage the “intelligence” in the data?

Looking for the program logic in the data rather than imposing that logic on the data has potential for changing the role of technology across industries and it will be exciting to see how it unfolds.

Categories: Uncategorized

Adapting New Data Management To Health Care

MedicalCrossThere is obviously a lot of talk about Big Data – data with relatively high Volume, Velocity and Variety.  In health care management, the need to handle big data is acute and exacerbated by the Veracity of data – the amount of historical truth about patients and procedures that must be retained over time.

In this article,  Charles Boicey explains how UC Irvine Medical Center is using Hadoop as an “adjunctive environment” to their existing enterprise data warehouse. The goal was to have a near real-time analytics data store instead of waiting for 24 hours for the extract-transform-load processes that had to take place before they could access their enterprise data warehouse.

What is interesting is the variety of data they want to access at one time – everything from nurses notes from the electronic medical records to lab results coming in from multiple internal and external sources (HL7). Traditional information architects would have to put a lot of thought into how to model the data to get it into a traditional data warehouse using tables and SQL – especially to get optimal performance for load times, retrieval, sharing, merging, mastering, and query efficiency and effectiveness.

I don’t think this story by itself is unique – there are lots of interesting use cases for Hadoop. What really caught my attention was a comment by one of the architects that the primary reason for the evolution of their MongoDB/Hadoop data store strategy was to avoid the need for data modeling.  I would suspect it was also much easier not to deal with all the process involved in extract/transform/load logic, security, and metadata management.  Does this mean the traditional IT approach was a hindrance to the business need?  Was there some kind of thought about canonical models and user access security that benefited from the collective experience in IT data management?

I think what it says is that Information Management professionals have to embrace the “self service” capabilities for analysis that are now available to business users, and work with them to help them get the business value they need while also helping them to understand the risks in exposing some of these great data stores to lots of potentially less sophisticated users.  At a minimum, everyone stands to gain from a security and data governance strategy focused on how to accomodate new models for information delivery rather than stifling innovation.

Can we adapt 20 years of information management process to the new paradigm without spoiling all the cool stuff?  I think so, and especially look forward to solving lots of interesting business problems we couldn’t touch in the past.

Metamarkets, A Practical Application For In-Memory Data Management

February 21, 2013 Leave a comment

MetaMarkets StackI had the opportunity to talk with representatives of Metamarkets this week about their utilization of an in-memory data store and found their argument for using in-memory data storage for real time (or near real-time) analysis compelling.

Metamarkets is a San Francisco company that provides real time presentation and analysis of event data.   Their first customers are interactive digital marketing companies like the Financial Times who are looking for real time feedback on the performance of advertising placements.  (They hinted that next week they will announce a very large customer using their technology to analyze on-demand video activity.)  Their solution is offered as a service running on Amazon Web Services and on-premise.    Their website posts the following stats: 

  • 300+ billion events ingested and processed per month
  • >100,000+ ad-hoc, multi-dimensional queries executed per day
  • 10+ TB of compressed, memory-mapped derived data
  • 500ms average query response time

The MetaMarkets data stack has interesting parallels to what SAS is doing with its Visual Analytics offering with essentially 4 layers of functionality.  Here is how it looks comparing the two side by side:

  SAS Visual Analytics MetaMarkets
Target Audience Business Users (self service) Business Users (self service)
Visual Presentation Flash JavaScript (proprietary scripts)
Analytics SAS R* (proprietary algorithms)
In-Memory Data Store SAS LASR Server Druid* Columnar Store
Staging / ETL GreenPlum, Teradata, HDFS* Hadoop *

*Open Source

One of the things I find most interesting is how much Hadoop (or HDFS) has become the “store and forward” method for capturing event data for subsequent processing for these vendors and possibly others pitching the equivalent of the “analytic data warehouse.” 

I also think there is some debate about how “real time” the analysis is for Metamarkets, given the latency of a Hadoop ETL layer.

Metamarkets developed Druid internally, and have released it as an open source project.  (They have a respectable following on GitHub, with about half as many followers as Impala from Cloudera, and twice as many as Voltdb as of the date of this post).  Time will tell if they gave away proprietary technology, or if they were smart to outsource development of what will become commodity-like technology to focus on the real intellectual property – the R language algorithms used to make sense of the data for non-data scientists.  I think the latter most likely.

Their business model and data architecture are oriented towards time-series data, but I didn’t see anything in their architecture that would limit them to time-series data in the future.

I think it is amazing what is being accomplished today with open source software.  This is a rich time to be in the analytics business and I look forward to some of the amazing insights to come from the availability of data and modeling capabilities previously available only to well-funded data scientists.

See more about Metamarkets here at DBMS2, and Druid here at Metamarkets.com.

Winds Change

January 25, 2013 Leave a comment

Owens ValleyI used to fly a hang glider cross country in the 100 mile long Owens Valley of California.  The valley runs North and South.  The typical prevailing wind is from South to North.  We would normally launch early in the morning to catch smooth air and to give ourselves sufficient time to navigate as far up the valley as possible while there was still daylight and safe flying conditions.

One of the first things you learn in Hang Gliding, especially when going cross country, is how to read a weather and wind forecast, and how to detect changes in the conditions while en route.  On a cool summer morning I launched from Walt’s Point at 9,000 feet with my friend Mike knowing that the wind was blowing North to South, but that if we were going to get in a flight that day we would have to fly south down the valley and just have fun without expecting any personal distance records.

All was going well and instead of going for distance we played around taking pictures as we slowly inched our way South until, off in the distance, we saw dust being blown up along the ground coming at us from the South -  a clear sign that a cold front was moving in with what was likely strong and turbulent wind.  

From day one of my hang gliding training I had it drilled into me that when it doubt, land and sit it out by my instructor Joe Greblo.   In fact, his philosophy was that you never change more than one piece of equipment or alter any part of your launch and flight schedule at a time, because if there was a problem you needed to be able to focus all your attention on fixing one thing that you changed instead of dealing with additional layers of complexity that can be introduced when you make more than one change at a time (which turned out to be good advice when it comes to making changes to things related to IT and computing).

In my head, my flight plan was to fly south as far as I could and then land safely.  When I saw the dust on the ground I knew there would be additional turbulence at altitude, and sure enough I was able to find a safe landing site and secure my glider before all heck cut loose.  Mike and I had discussed finding a safe spot to land before the storm hit via short wave radios, but I ignored the radio during my landing, giving the task of landing in the open desert the full attention it deserved.

As the dust storm passed I hit the microphone to find out where Mike was, only to catch him yelling at the top of his lungs that he was riding the front North, headed for what was a great flight for the day up the length of the valley in record time.

When I think back over the transformative “fronts” that moved through the data processing world over the last 20 years I am reminded of times when I turned and went with it (data warehouse, cloud computing, big data) or sat it out (search, Internet advertising).   Is in-memory database one of those times when we need to go with it?  What other transformations might I be missing that will be obvious in hind-sight?

Categories: Uncategorized Tags:

Broken Stacks

January 17, 2013 Leave a comment

Layer CakeI actively traded futures and equities before the markets became erratic.   During that time I developed hundreds of models and tools for portfolio management.  I did all my work on a desktop that was devoted solely to trading, and had it in storage the last couple years as I got involved in other activities.  Last week I went to a conference where I met a company that specializes in risk management and I got the bug to dust off my homegrown portfolio manager, so I fired up the desktop I hadn’t touched in 2 years to run through it again.  And it was BROKEN.

How could that happen, since none of the code was ever changed?  It has a web interface so I could check it when I was mobile.  I was using a Windows/Apache/MySQL/PHP stack.  All the services were running, and I went through all the logs trying to figure out why PHP couldn’t find MySQL.   I spent hours going through the process of elimination.   What had changed?

After ranting for a while about how brittle computing stacks are, it occurred to me the only thing that could have changed is an antivirus program I had on the machine that probably called home for two years of updates.   Sure enough, when I shut the antivirus program down everything worked as just as it had two years earlier.

In the ideal world, the stack would have done the diagnostics for me like some monolithic brain providing output in clear English and either healing itself, or pointing out what the likely culprit(s) might be.  The stack approach to computing has served well, allowing technology layers to be swapped, replaced, and upgraded as needed without having to replace everything from application through database to operating system when one of the components gets an upgrade.   But – it introduces overhead in terms of communication between the layers, multiple points of failure, and complexity in isolating problems.

Why do I mention it?   Because I think the move to in-memory computing is a great step in eliminating complexity and a big point of failure in the hardware stack.   Cutting out all the overhead in programs that have to talk to the hard drives and maintain the data on the hard drives is a huge boost to application performance and data center productivity.  It also means eliminating the hard drives – the data center component most prone to mechanical failure.

I’m sure there are people working on integrated software development and deployment environments that will be akin to eliminating hard drives.   I’m not sure what great leaps they will make, but I’m looking forward to it.

Categories: Uncategorized

Practical Master Data Management

December 14, 2012 Leave a comment

We spend a lot of time talking with customers about master data management (MDM).  Here are a few bullets to convey the meat of the conversation.

1. Why MDM?

You didn’t need MDM when you had one application and one database.   The nature of things like customer, product, and supplier were understood by everyone who used the application and the data.  As soon as you built or acquired another application with its own database that referred to the same things, you introduced the potential for a mismatch in the description (attributes) attributed to those things.

For example, the billing system has customer address as a Post Office Box and the shipping system has the physical customer address as the shipping address.  What address should a new marketing campaign use?  If I want to do market segmentation based on zip code, and the zip codes are different, which one should I use?

This example is overly simplified.   The problem is usually more along the lines of the customer having 10 source systems with 10 things each having 50 attributes and no automated way of applying the rules to make the decisions about the “right” attributes to use for 100 million occurrences of those “things.”

2. Master Data And Reference Data Are Different

Some customers get confused about the difference between master data and reference data.   Reference data is something like the table of 2 character state postal codes we use in the USA.  They rarely change and are used by all your applications for consistency.  As we will see, Master data is the result of combining two or more sources to get the “best” combined representation.

MDM Example

3. Master Data Is Operational

In the data-to-day operation of the business, Master Data is about reconciling the difference between two or more systems and having a single version that represents what someone in the company has defined as the “truth” (at least according to the way they use the data.)  You wouldn’t need Master Data Management (MDM) if you changed your operations so that everyone was using the same source data, entering all new data related to the same entity the same way, and all data imported from outside sources conformed to your data quality and business rules.  In real life, few organizations are prepared to go back and change their operations to meet these criteria, so you are resigned to Master Data Management to address the need for a consistent view of the items you master.

4. Data Quality Precedes MDM

The process of matching disparate sources of data to identify master entities is enhanced significantly when the source data is cleansed and standardized prior to applying matching rules.   Strike that – cleaning up and standardizing the data is required before attempting to do MDM.

MDM Example

5. MDM Can Be A Step In A Process

Sometimes the output of an MDM process is used as a way to create a static Master File for input to other systems, or as input to a data warehouse or data mart.  In the latter case, we sometimes do some enhancement as well.  For example, combining attributes from multiple systems into a “wide” master record that has all the attributes from the source systems would be common for later analysis purposes.  Just to be clear, this augmentation is not MDM, it is enabled by MDM.  MDM is about the identification of an entity and clustering the source records that pertain to that entity.

MDM Example

6. Operational MDM Can Be Performed Real Time In A Hub

An MDM Hub is a server that provides MDM services on request.   The same rules developed for data quality, entity identification, and the creation of a “master” record can be performed upon request by a server so that the resulting master record reflects the actual current state of the source systems.  The hub can be used to:

  • verify that an item already exists in one or more systems (potentially eliminating the entry of duplicates)
  • standardize the application of data quality processes
  • share operational “rules” that span systems and processes (if billing and shipping addresses are different, then verify shipping address before committing the order)
  • provide an administrative interface for human intervention and workflow for exception handling

MDM Example

MDM processes and hubs need to be customized for every client situation.  It is an effort that has to involve the entire enterprise.

7. MDM As A Service

In the ideal future state, data quality enforcement and MDM could be standardized and provided as a service to all applications in the enterprise.   Centralizing these functions instead of performing them in different ways in each application could significantly reduce the amount of work that has to take place reconciling differences in the data.   Delivery via an enterprise cloud, or eventually shared services in a public cloud are possibilities.

SAP Announces Predictive Analytics Software

November 28, 2012 Leave a comment

I just noticed that yesterday SAP announced the general availability of SAP Predictive Analysis software based on R and HANA.   I’ve been racking my brain trying to figure out why SAS or SPSS wasn’t all over HANA given the incredible potential for real “real-time” analytics.  Now I know why.

http://www.sap.com/news-reader/index.epx?articleID=19981

I don’t know how robust the offering from SAP is at this point, but it can only get better as more and more analytical application developers adapt their offerings to R.

It remains to be seen how effective the SAP sales forces will be at “selling” analytics, especially since R is open-source.  I expect it will be more of a draw the other way, where R developers are interested in HANA as the underlying datastore for their applications.

Categories: Uncategorized Tags: , , ,
Follow

Get every new post delivered to your Inbox.