I have to admit, I was a skeptic when I first heard about BitYota. I wasn’t sure how a small startup team could compete with database companies with hundreds of developers and 10 or more years of time to perfect their software. But I have to give a shut out to Harmeek Singh Bedi and Paresh Goswami and all the talented people who work with them for pulling it off! I’ve seen them put the software through it’s paces on Azure and AWS without breaking a sweat – like 30,000 transactions per second data loads on a single node, sub 1 hour batch loads of a TB from AWS across the Internet to Azure, and very complex queries running concurrently in minutes against 20+ Billion rows (comparable to the best performance from a highly tailored on premise data appliance vendor). Granted, they haven’t had time to perfect all the bells and whistles that the big guys have had years to build out, but at the core level the software performs where it really counts.
I don’t know what the future holds for the company, but this engineering team should be very proud of what they have accomplished!
I’m excited to have joined BitYota as an evangelist for our data warehouse as a service after spending the last 20 years helping customers implement on-premise data warehouse architectures. The time is right for the cloud data warehouse – the technology has matured, networks have matured, and the mindset of the enterprise data architect is turning to the cloud.
Inside BitYota we hear a lot about the principles the founders wanted to establish for the service – analytics over multiple sources of data, rapid agile delivery of results, and built for the cloud. It’s where the market for data warehouse is headed long term.
Late last year we announced a partnership with Microsoft to deliver BitYota on the Azure cloud. A day doesn’t go by when someone doesn’t ask me how we fit in that ecosystem. I’m going to layout my answer utilizing a diagram provided by Microsoft here: http://www.microsoft.com/en-us/server-cloud/ (source: Chappell and Associates)
The four quadrants in this diagram categorize the current database offerings on Azure; Operational SQL Technologies, Operational NoSQL Technologies, Analytic SQL Technologies, and Analytic NoSQL Technologies. We define the Operational Data as transactional, with lots of inserts, updates and deletes. The Analytic Data is typically queries, with lots of search and analytics.
For our purposes, we will be looking more closely at the Analytic data column. Here, we can see that the characteristics of the NoSQL quadrant requires parallel data loading, fast parallel queries and response times, and the ability to manipulate structured and unstructured data.
A Closer Look at the Analytic Data Offerings
The SQL quadrant requires a SQL interface to access the data, stores data in rows and columns, and enables data cubes for ease of end user reporting.
Now imagine an Azure data offering that loads data in parallel, performs massively parallel queries with response times in seconds, using your existing SQL tools and scripts, over relational data and JSON and XML. You’ve just imagined BitYota on Azure!
MPP Data Warehouse Service on Azure
Of course there is much more to the service than what is shown on this diagram, but I hope it helps to communicate where BitYota fits in the Azure data ecosystem today. This is just the beginning for BitYota as the service evolves to enhance the customer experience doing discovery and analysis of data using Azure.
When I was in college I studied data structures with Dr. Mary Loomis. We were learning how to program data structures of all kinds in memory. At the time, most programs were written with a basic read-a-record / process-a-record / write-a-record sequence. The same programming sequence would be repeated thousands of times until an end-of-file occurred. The emphasis for us as programmers was to make the programs as efficient as possible in terms of CPU time and CPU space.
Dr. Loomis taught us the value of separating the structure and the management of data from the actual program logic as a way of creating even greater efficiency. This approach launched wildly successful companies like Ingres, Sybase, and Oracle.
Did we lose something in the process of making data a subsystem that serves program logic?
One place to look for an answer to that question might be in the “analytics” space, where data scientists are building models around “big” data. They often spend significant time acquiring and formatting data in a way that fits their specific needs for the models they are testing. For them, the program “logic” for their operating models is actually in the data itself. They are either:
- Exploring the data to find out what is happening in the real world (what the program is)
- Creating models in their heads that are then tested and refined against the “program” the data contains
- Adapting models that have worked on other data sets to see if there is a fit.
In this case, shouldn’t the “structure” in the data be telling us what the “program” should be? Are we cloaking the “intelligence” and “programs” in the data by forcing it into database models that are prescribed for the efficiencies we needed when CPU’s were relatively slow and had very limited memory?
Why It Matters
The new world of in-memory databases can significantly alter the way we think about data as the source for program logic, as opposed to seeing it as only feedstock for pre-defined program logic.
One of the problems with analytics models is that they can be difficult to deploy into a production environment. They are often defined with statistical programs that are isolated from production environments with no formal process for implementing the results in production without rewriting production systems in some other programming environment, with associated big budgets and long lead times. New in-memory data structures that can support both operational and analytical needs means that in the future it will be possible to build analytic models using the same data and infrastructure as operations. This means that robust modeling languages with the capability of defining and executing program logic can be placed in “production” by flipping a switch from test to live. Your data scientists and business managers could be working in real time to monitor, model, and change your operations to adapt to economic and social changes as they are occurring.
What’s even more interesting is what happens if we follow the line of thinking that the program is in the data. In other words, what if our models learn and adapt based on the state of all the data and changes in the data?
Today, it is possible for a data scientist to create a decision tree (all the if-then-else logic) that explains something we see in the real world. For example, this can be very useful for explaining and predicting customer behavior based on historical data. This week a major airline lost their entire traffic management system for two hours. There was no historical basis for modeling the impact this had on other airlines, car rental, and lodging companies. What if their models were instead, learning models that could have adapted within hours or even minutes to the flow of new customer behaviors, creating new capacity and pricing models to leverage the “intelligence” in the data?
Looking for the program logic in the data rather than imposing that logic on the data has potential for changing the role of technology across industries and it will be exciting to see how it unfolds.
There is obviously a lot of talk about Big Data – data with relatively high Volume, Velocity and Variety. In health care management, the need to handle big data is acute and exacerbated by the Veracity of data – the amount of historical truth about patients and procedures that must be retained over time.
In this article, Charles Boicey explains how UC Irvine Medical Center is using Hadoop as an “adjunctive environment” to their existing enterprise data warehouse. The goal was to have a near real-time analytics data store instead of waiting for 24 hours for the extract-transform-load processes that had to take place before they could access their enterprise data warehouse.
What is interesting is the variety of data they want to access at one time – everything from nurses notes from the electronic medical records to lab results coming in from multiple internal and external sources (HL7). Traditional information architects would have to put a lot of thought into how to model the data to get it into a traditional data warehouse using tables and SQL – especially to get optimal performance for load times, retrieval, sharing, merging, mastering, and query efficiency and effectiveness.
I don’t think this story by itself is unique – there are lots of interesting use cases for Hadoop. What really caught my attention was a comment by one of the architects that the primary reason for the evolution of their MongoDB/Hadoop data store strategy was to avoid the need for data modeling. I would suspect it was also much easier not to deal with all the process involved in extract/transform/load logic, security, and metadata management. Does this mean the traditional IT approach was a hindrance to the business need? Was there some kind of thought about canonical models and user access security that benefited from the collective experience in IT data management?
I think what it says is that Information Management professionals have to embrace the “self service” capabilities for analysis that are now available to business users, and work with them to help them get the business value they need while also helping them to understand the risks in exposing some of these great data stores to lots of potentially less sophisticated users. At a minimum, everyone stands to gain from a security and data governance strategy focused on how to accomodate new models for information delivery rather than stifling innovation.
Can we adapt 20 years of information management process to the new paradigm without spoiling all the cool stuff? I think so, and especially look forward to solving lots of interesting business problems we couldn’t touch in the past.
I used to fly a hang glider cross country in the 100 mile long Owens Valley of California. The valley runs North and South. The typical prevailing wind is from South to North. We would normally launch early in the morning to catch smooth air and to give ourselves sufficient time to navigate as far up the valley as possible while there was still daylight and safe flying conditions.
One of the first things you learn in Hang Gliding, especially when going cross country, is how to read a weather and wind forecast, and how to detect changes in the conditions while en route. On a cool summer morning I launched from Walt’s Point at 9,000 feet with my friend Mike knowing that the wind was blowing North to South, but that if we were going to get in a flight that day we would have to fly south down the valley and just have fun without expecting any personal distance records.
All was going well and instead of going for distance we played around taking pictures as we slowly inched our way South until, off in the distance, we saw dust being blown up along the ground coming at us from the South – a clear sign that a cold front was moving in with what was likely strong and turbulent wind.
From day one of my hang gliding training I had it drilled into me that when it doubt, land and sit it out by my instructor Joe Greblo. In fact, his philosophy was that you never change more than one piece of equipment or alter any part of your launch and flight schedule at a time, because if there was a problem you needed to be able to focus all your attention on fixing one thing that you changed instead of dealing with additional layers of complexity that can be introduced when you make more than one change at a time (which turned out to be good advice when it comes to making changes to things related to IT and computing).
In my head, my flight plan was to fly south as far as I could and then land safely. When I saw the dust on the ground I knew there would be additional turbulence at altitude, and sure enough I was able to find a safe landing site and secure my glider before all heck cut loose. Mike and I had discussed finding a safe spot to land before the storm hit via short wave radios, but I ignored the radio during my landing, giving the task of landing in the open desert the full attention it deserved.
As the dust storm passed I hit the microphone to find out where Mike was, only to catch him yelling at the top of his lungs that he was riding the front North, headed for what was a great flight for the day up the length of the valley in record time.
When I think back over the transformative “fronts” that moved through the data processing world over the last 20 years I am reminded of times when I turned and went with it (data warehouse, cloud computing, big data) or sat it out (search, Internet advertising). Is in-memory database one of those times when we need to go with it? What other transformations might I be missing that will be obvious in hind-sight?
I actively traded futures and equities before the markets became erratic. During that time I developed hundreds of models and tools for portfolio management. I did all my work on a desktop that was devoted solely to trading, and had it in storage the last couple years as I got involved in other activities. Last week I went to a conference where I met a company that specializes in risk management and I got the bug to dust off my homegrown portfolio manager, so I fired up the desktop I hadn’t touched in 2 years to run through it again. And it was BROKEN.
How could that happen, since none of the code was ever changed? It has a web interface so I could check it when I was mobile. I was using a Windows/Apache/MySQL/PHP stack. All the services were running, and I went through all the logs trying to figure out why PHP couldn’t find MySQL. I spent hours going through the process of elimination. What had changed?
After ranting for a while about how brittle computing stacks are, it occurred to me the only thing that could have changed is an antivirus program I had on the machine that probably called home for two years of updates. Sure enough, when I shut the antivirus program down everything worked as just as it had two years earlier.
In the ideal world, the stack would have done the diagnostics for me like some monolithic brain providing output in clear English and either healing itself, or pointing out what the likely culprit(s) might be. The stack approach to computing has served well, allowing technology layers to be swapped, replaced, and upgraded as needed without having to replace everything from application through database to operating system when one of the components gets an upgrade. But – it introduces overhead in terms of communication between the layers, multiple points of failure, and complexity in isolating problems.
Why do I mention it? Because I think the move to in-memory computing is a great step in eliminating complexity and a big point of failure in the hardware stack. Cutting out all the overhead in programs that have to talk to the hard drives and maintain the data on the hard drives is a huge boost to application performance and data center productivity. It also means eliminating the hard drives – the data center component most prone to mechanical failure.
I’m sure there are people working on integrated software development and deployment environments that will be akin to eliminating hard drives. I’m not sure what great leaps they will make, but I’m looking forward to it.