I’m excited to have joined BitYota as an evangelist for our data warehouse as a service after spending the last 20 years helping customers implement on-premise data warehouse architectures. The time is right for the cloud data warehouse – the technology has matured, networks have matured, and the mindset of the enterprise data architect is turning to the cloud.
Inside BitYota we hear a lot about the principles the founders wanted to establish for the service – analytics over multiple sources of data, rapid agile delivery of results, and built for the cloud. It’s where the market for data warehouse is headed long term.
Late last year we announced a partnership with Microsoft to deliver BitYota on the Azure cloud. A day doesn’t go by when someone doesn’t ask me how we fit in that ecosystem. I’m going to layout my answer utilizing a diagram provided by Microsoft here: http://www.microsoft.com/en-us/server-cloud/ (source: Chappell and Associates)
The four quadrants in this diagram categorize the current database offerings on Azure; Operational SQL Technologies, Operational NoSQL Technologies, Analytic SQL Technologies, and Analytic NoSQL Technologies. We define the Operational Data as transactional, with lots of inserts, updates and deletes. The Analytic Data is typically queries, with lots of search and analytics.
For our purposes, we will be looking more closely at the Analytic data column. Here, we can see that the characteristics of the NoSQL quadrant requires parallel data loading, fast parallel queries and response times, and the ability to manipulate structured and unstructured data.
A Closer Look at the Analytic Data Offerings
The SQL quadrant requires a SQL interface to access the data, stores data in rows and columns, and enables data cubes for ease of end user reporting.
Now imagine an Azure data offering that loads data in parallel, performs massively parallel queries with response times in seconds, using your existing SQL tools and scripts, over relational data and JSON and XML. You’ve just imagined BitYota on Azure!
MPP Data Warehouse Service on Azure
Of course there is much more to the service than what is shown on this diagram, but I hope it helps to communicate where BitYota fits in the Azure data ecosystem today. This is just the beginning for BitYota as the service evolves to enhance the customer experience doing discovery and analysis of data using Azure.
When I was in college I studied data structures with Dr. Mary Loomis. We were learning how to program data structures of all kinds in memory. At the time, most programs were written with a basic read-a-record / process-a-record / write-a-record sequence. The same programming sequence would be repeated thousands of times until an end-of-file occurred. The emphasis for us as programmers was to make the programs as efficient as possible in terms of CPU time and CPU space.
Dr. Loomis taught us the value of separating the structure and the management of data from the actual program logic as a way of creating even greater efficiency. This approach launched wildly successful companies like Ingres, Sybase, and Oracle.
Did we lose something in the process of making data a subsystem that serves program logic?
One place to look for an answer to that question might be in the “analytics” space, where data scientists are building models around “big” data. They often spend significant time acquiring and formatting data in a way that fits their specific needs for the models they are testing. For them, the program “logic” for their operating models is actually in the data itself. They are either:
- Exploring the data to find out what is happening in the real world (what the program is)
- Creating models in their heads that are then tested and refined against the “program” the data contains
- Adapting models that have worked on other data sets to see if there is a fit.
In this case, shouldn’t the “structure” in the data be telling us what the “program” should be? Are we cloaking the “intelligence” and “programs” in the data by forcing it into database models that are prescribed for the efficiencies we needed when CPU’s were relatively slow and had very limited memory?
Why It Matters
The new world of in-memory databases can significantly alter the way we think about data as the source for program logic, as opposed to seeing it as only feedstock for pre-defined program logic.
One of the problems with analytics models is that they can be difficult to deploy into a production environment. They are often defined with statistical programs that are isolated from production environments with no formal process for implementing the results in production without rewriting production systems in some other programming environment, with associated big budgets and long lead times. New in-memory data structures that can support both operational and analytical needs means that in the future it will be possible to build analytic models using the same data and infrastructure as operations. This means that robust modeling languages with the capability of defining and executing program logic can be placed in “production” by flipping a switch from test to live. Your data scientists and business managers could be working in real time to monitor, model, and change your operations to adapt to economic and social changes as they are occurring.
What’s even more interesting is what happens if we follow the line of thinking that the program is in the data. In other words, what if our models learn and adapt based on the state of all the data and changes in the data?
Today, it is possible for a data scientist to create a decision tree (all the if-then-else logic) that explains something we see in the real world. For example, this can be very useful for explaining and predicting customer behavior based on historical data. This week a major airline lost their entire traffic management system for two hours. There was no historical basis for modeling the impact this had on other airlines, car rental, and lodging companies. What if their models were instead, learning models that could have adapted within hours or even minutes to the flow of new customer behaviors, creating new capacity and pricing models to leverage the “intelligence” in the data?
Looking for the program logic in the data rather than imposing that logic on the data has potential for changing the role of technology across industries and it will be exciting to see how it unfolds.
There is obviously a lot of talk about Big Data – data with relatively high Volume, Velocity and Variety. In health care management, the need to handle big data is acute and exacerbated by the Veracity of data – the amount of historical truth about patients and procedures that must be retained over time.
In this article, Charles Boicey explains how UC Irvine Medical Center is using Hadoop as an “adjunctive environment” to their existing enterprise data warehouse. The goal was to have a near real-time analytics data store instead of waiting for 24 hours for the extract-transform-load processes that had to take place before they could access their enterprise data warehouse.
What is interesting is the variety of data they want to access at one time – everything from nurses notes from the electronic medical records to lab results coming in from multiple internal and external sources (HL7). Traditional information architects would have to put a lot of thought into how to model the data to get it into a traditional data warehouse using tables and SQL – especially to get optimal performance for load times, retrieval, sharing, merging, mastering, and query efficiency and effectiveness.
I don’t think this story by itself is unique – there are lots of interesting use cases for Hadoop. What really caught my attention was a comment by one of the architects that the primary reason for the evolution of their MongoDB/Hadoop data store strategy was to avoid the need for data modeling. I would suspect it was also much easier not to deal with all the process involved in extract/transform/load logic, security, and metadata management. Does this mean the traditional IT approach was a hindrance to the business need? Was there some kind of thought about canonical models and user access security that benefited from the collective experience in IT data management?
I think what it says is that Information Management professionals have to embrace the “self service” capabilities for analysis that are now available to business users, and work with them to help them get the business value they need while also helping them to understand the risks in exposing some of these great data stores to lots of potentially less sophisticated users. At a minimum, everyone stands to gain from a security and data governance strategy focused on how to accomodate new models for information delivery rather than stifling innovation.
Can we adapt 20 years of information management process to the new paradigm without spoiling all the cool stuff? I think so, and especially look forward to solving lots of interesting business problems we couldn’t touch in the past.
I used to fly a hang glider cross country in the 100 mile long Owens Valley of California. The valley runs North and South. The typical prevailing wind is from South to North. We would normally launch early in the morning to catch smooth air and to give ourselves sufficient time to navigate as far up the valley as possible while there was still daylight and safe flying conditions.
One of the first things you learn in Hang Gliding, especially when going cross country, is how to read a weather and wind forecast, and how to detect changes in the conditions while en route. On a cool summer morning I launched from Walt’s Point at 9,000 feet with my friend Mike knowing that the wind was blowing North to South, but that if we were going to get in a flight that day we would have to fly south down the valley and just have fun without expecting any personal distance records.
All was going well and instead of going for distance we played around taking pictures as we slowly inched our way South until, off in the distance, we saw dust being blown up along the ground coming at us from the South – a clear sign that a cold front was moving in with what was likely strong and turbulent wind.
From day one of my hang gliding training I had it drilled into me that when it doubt, land and sit it out by my instructor Joe Greblo. In fact, his philosophy was that you never change more than one piece of equipment or alter any part of your launch and flight schedule at a time, because if there was a problem you needed to be able to focus all your attention on fixing one thing that you changed instead of dealing with additional layers of complexity that can be introduced when you make more than one change at a time (which turned out to be good advice when it comes to making changes to things related to IT and computing).
In my head, my flight plan was to fly south as far as I could and then land safely. When I saw the dust on the ground I knew there would be additional turbulence at altitude, and sure enough I was able to find a safe landing site and secure my glider before all heck cut loose. Mike and I had discussed finding a safe spot to land before the storm hit via short wave radios, but I ignored the radio during my landing, giving the task of landing in the open desert the full attention it deserved.
As the dust storm passed I hit the microphone to find out where Mike was, only to catch him yelling at the top of his lungs that he was riding the front North, headed for what was a great flight for the day up the length of the valley in record time.
When I think back over the transformative “fronts” that moved through the data processing world over the last 20 years I am reminded of times when I turned and went with it (data warehouse, cloud computing, big data) or sat it out (search, Internet advertising). Is in-memory database one of those times when we need to go with it? What other transformations might I be missing that will be obvious in hind-sight?
I actively traded futures and equities before the markets became erratic. During that time I developed hundreds of models and tools for portfolio management. I did all my work on a desktop that was devoted solely to trading, and had it in storage the last couple years as I got involved in other activities. Last week I went to a conference where I met a company that specializes in risk management and I got the bug to dust off my homegrown portfolio manager, so I fired up the desktop I hadn’t touched in 2 years to run through it again. And it was BROKEN.
How could that happen, since none of the code was ever changed? It has a web interface so I could check it when I was mobile. I was using a Windows/Apache/MySQL/PHP stack. All the services were running, and I went through all the logs trying to figure out why PHP couldn’t find MySQL. I spent hours going through the process of elimination. What had changed?
After ranting for a while about how brittle computing stacks are, it occurred to me the only thing that could have changed is an antivirus program I had on the machine that probably called home for two years of updates. Sure enough, when I shut the antivirus program down everything worked as just as it had two years earlier.
In the ideal world, the stack would have done the diagnostics for me like some monolithic brain providing output in clear English and either healing itself, or pointing out what the likely culprit(s) might be. The stack approach to computing has served well, allowing technology layers to be swapped, replaced, and upgraded as needed without having to replace everything from application through database to operating system when one of the components gets an upgrade. But – it introduces overhead in terms of communication between the layers, multiple points of failure, and complexity in isolating problems.
Why do I mention it? Because I think the move to in-memory computing is a great step in eliminating complexity and a big point of failure in the hardware stack. Cutting out all the overhead in programs that have to talk to the hard drives and maintain the data on the hard drives is a huge boost to application performance and data center productivity. It also means eliminating the hard drives – the data center component most prone to mechanical failure.
I’m sure there are people working on integrated software development and deployment environments that will be akin to eliminating hard drives. I’m not sure what great leaps they will make, but I’m looking forward to it.
We spend a lot of time talking with customers about master data management (MDM). Here are a few bullets to convey the meat of the conversation.
1. Why MDM?
You didn’t need MDM when you had one application and one database. The nature of things like customer, product, and supplier were understood by everyone who used the application and the data. As soon as you built or acquired another application with its own database that referred to the same things, you introduced the potential for a mismatch in the description (attributes) attributed to those things.
For example, the billing system has customer address as a Post Office Box and the shipping system has the physical customer address as the shipping address. What address should a new marketing campaign use? If I want to do market segmentation based on zip code, and the zip codes are different, which one should I use?
This example is overly simplified. The problem is usually more along the lines of the customer having 10 source systems with 10 things each having 50 attributes and no automated way of applying the rules to make the decisions about the “right” attributes to use for 100 million occurrences of those “things.”
2. Master Data And Reference Data Are Different
Some customers get confused about the difference between master data and reference data. Reference data is something like the table of 2 character state postal codes we use in the USA. They rarely change and are used by all your applications for consistency. As we will see, Master data is the result of combining two or more sources to get the “best” combined representation.
3. Master Data Is Operational
In the data-to-day operation of the business, Master Data is about reconciling the difference between two or more systems and having a single version that represents what someone in the company has defined as the “truth” (at least according to the way they use the data.) You wouldn’t need Master Data Management (MDM) if you changed your operations so that everyone was using the same source data, entering all new data related to the same entity the same way, and all data imported from outside sources conformed to your data quality and business rules. In real life, few organizations are prepared to go back and change their operations to meet these criteria, so you are resigned to Master Data Management to address the need for a consistent view of the items you master.
4. Data Quality Precedes MDM
The process of matching disparate sources of data to identify master entities is enhanced significantly when the source data is cleansed and standardized prior to applying matching rules. Strike that – cleaning up and standardizing the data is required before attempting to do MDM.
5. MDM Can Be A Step In A Process
Sometimes the output of an MDM process is used as a way to create a static Master File for input to other systems, or as input to a data warehouse or data mart. In the latter case, we sometimes do some enhancement as well. For example, combining attributes from multiple systems into a “wide” master record that has all the attributes from the source systems would be common for later analysis purposes. Just to be clear, this augmentation is not MDM, it is enabled by MDM. MDM is about the identification of an entity and clustering the source records that pertain to that entity.
6. Operational MDM Can Be Performed Real Time In A Hub
An MDM Hub is a server that provides MDM services on request. The same rules developed for data quality, entity identification, and the creation of a “master” record can be performed upon request by a server so that the resulting master record reflects the actual current state of the source systems. The hub can be used to:
- verify that an item already exists in one or more systems (potentially eliminating the entry of duplicates)
- standardize the application of data quality processes
- share operational “rules” that span systems and processes (if billing and shipping addresses are different, then verify shipping address before committing the order)
- provide an administrative interface for human intervention and workflow for exception handling
MDM processes and hubs need to be customized for every client situation. It is an effort that has to involve the entire enterprise.
7. MDM As A Service
In the ideal future state, data quality enforcement and MDM could be standardized and provided as a service to all applications in the enterprise. Centralizing these functions instead of performing them in different ways in each application could significantly reduce the amount of work that has to take place reconciling differences in the data. Delivery via an enterprise cloud, or eventually shared services in a public cloud are possibilities.