Archive

Posts Tagged ‘in-memory database’

Metamarkets, A Practical Application For In-Memory Data Management

February 21, 2013 Leave a comment

MetaMarkets StackI had the opportunity to talk with representatives of Metamarkets this week about their utilization of an in-memory data store and found their argument for using in-memory data storage for real time (or near real-time) analysis compelling.

Metamarkets is a San Francisco company that provides real time presentation and analysis of event data.   Their first customers are interactive digital marketing companies like the Financial Times who are looking for real time feedback on the performance of advertising placements.  (They hinted that next week they will announce a very large customer using their technology to analyze on-demand video activity.)  Their solution is offered as a service running on Amazon Web Services and on-premise.    Their website posts the following stats: 

  • 300+ billion events ingested and processed per month
  • >100,000+ ad-hoc, multi-dimensional queries executed per day
  • 10+ TB of compressed, memory-mapped derived data
  • 500ms average query response time

The MetaMarkets data stack has interesting parallels to what SAS is doing with its Visual Analytics offering with essentially 4 layers of functionality.  Here is how it looks comparing the two side by side:

  SAS Visual Analytics MetaMarkets
Target Audience Business Users (self service) Business Users (self service)
Visual Presentation Flash JavaScript (proprietary scripts)
Analytics SAS R* (proprietary algorithms)
In-Memory Data Store SAS LASR Server Druid* Columnar Store
Staging / ETL GreenPlum, Teradata, HDFS* Hadoop *

*Open Source

One of the things I find most interesting is how much Hadoop (or HDFS) has become the “store and forward” method for capturing event data for subsequent processing for these vendors and possibly others pitching the equivalent of the “analytic data warehouse.” 

I also think there is some debate about how “real time” the analysis is for Metamarkets, given the latency of a Hadoop ETL layer.

Metamarkets developed Druid internally, and have released it as an open source project.  (They have a respectable following on GitHub, with about half as many followers as Impala from Cloudera, and twice as many as Voltdb as of the date of this post).  Time will tell if they gave away proprietary technology, or if they were smart to outsource development of what will become commodity-like technology to focus on the real intellectual property – the R language algorithms used to make sense of the data for non-data scientists.  I think the latter most likely.

Their business model and data architecture are oriented towards time-series data, but I didn’t see anything in their architecture that would limit them to time-series data in the future.

I think it is amazing what is being accomplished today with open source software.  This is a rich time to be in the analytics business and I look forward to some of the amazing insights to come from the availability of data and modeling capabilities previously available only to well-funded data scientists.

See more about Metamarkets here at DBMS2, and Druid here at Metamarkets.com.

Winds Change

January 25, 2013 Leave a comment

Owens ValleyI used to fly a hang glider cross country in the 100 mile long Owens Valley of California.  The valley runs North and South.  The typical prevailing wind is from South to North.  We would normally launch early in the morning to catch smooth air and to give ourselves sufficient time to navigate as far up the valley as possible while there was still daylight and safe flying conditions.

One of the first things you learn in Hang Gliding, especially when going cross country, is how to read a weather and wind forecast, and how to detect changes in the conditions while en route.  On a cool summer morning I launched from Walt’s Point at 9,000 feet with my friend Mike knowing that the wind was blowing North to South, but that if we were going to get in a flight that day we would have to fly south down the valley and just have fun without expecting any personal distance records.

All was going well and instead of going for distance we played around taking pictures as we slowly inched our way South until, off in the distance, we saw dust being blown up along the ground coming at us from the South –  a clear sign that a cold front was moving in with what was likely strong and turbulent wind.  

From day one of my hang gliding training I had it drilled into me that when it doubt, land and sit it out by my instructor Joe Greblo.   In fact, his philosophy was that you never change more than one piece of equipment or alter any part of your launch and flight schedule at a time, because if there was a problem you needed to be able to focus all your attention on fixing one thing that you changed instead of dealing with additional layers of complexity that can be introduced when you make more than one change at a time (which turned out to be good advice when it comes to making changes to things related to IT and computing).

In my head, my flight plan was to fly south as far as I could and then land safely.  When I saw the dust on the ground I knew there would be additional turbulence at altitude, and sure enough I was able to find a safe landing site and secure my glider before all heck cut loose.  Mike and I had discussed finding a safe spot to land before the storm hit via short wave radios, but I ignored the radio during my landing, giving the task of landing in the open desert the full attention it deserved.

As the dust storm passed I hit the microphone to find out where Mike was, only to catch him yelling at the top of his lungs that he was riding the front North, headed for what was a great flight for the day up the length of the valley in record time.

When I think back over the transformative “fronts” that moved through the data processing world over the last 20 years I am reminded of times when I turned and went with it (data warehouse, cloud computing, big data) or sat it out (search, Internet advertising).   Is in-memory database one of those times when we need to go with it?  What other transformations might I be missing that will be obvious in hind-sight?

Categories: Uncategorized Tags:

A LIST For SAP HANA

November 8, 2012 1 comment

Katana SwordSAP HANA seems to be getting some traction.  I happen to believe in the future of in-memory databases.  It only makes sense that we should try and balance the mismatch between processor speeds and the availability of data for processing.

I think SAP has some work ahead of them marketing HANA.  Here is a LIST of things I think the need to start doing immediately to boost their chances of making it a generally accepted database in SAP and non-SAP accounts.

Language – SAP needs to own the term in-memory database.   In-memory is being widely used by Oracle and others to refer to lots of things that happen with data, without fitting the requirements for being a truly in-memory database as defined here .  The problem is that in-memory database may quickly become so over and mis-used it will become a useless marketing term like “big data”.  SAP needs to aggressively promote a strict definition for the label in-memory database and call out any vendor who misuses it.

Image– It’s hard for any database product to say it can do everything – we just don’t believe it anymore.   HANA needs to occupy one space in the mind of IT, and I believe that space is speed.   The advantages of data processing speed will apply regardless of where HANA is eventually applied, whether it is transaction processing or decision support, operational data store or data warehouse.   Everyone wants more data processing faster.   It is probably the most defensible position for HANA.

On a side note, I keep hearing HANA being pronounced as Hanna.   A woman’s name probably doesn’t invoke the image they are going for.   I would emphasize a pronunciation that rhymes with Katana – the deadly Japanese sword.

Sentiment – The Internet makes it possible to influence hundreds and thousands of potential HANA users negatively with a few key strokes.   SAP is going to have to actively monitor and immediately respond to mis-information and requests for information about HANA in every blog and competitor release.  Interestingly enough, the name HANA isn’t that hard to monitor because it is somewhat unique.

Technology – CIO’s and other IT leaders play an unfair game of chess every time they put together their IT strategy.   It’s them against IBM, Oracle, Accenture, and every other technology vendor who is playing their own strategy to take IT’s money and time.  And they don’t have to follow the rules – bypassing the CIO and going directly to business executives to change the rules.

SAP HANA is going to have to prove they are on the side of the CIO and working to advance his or her game.   They have to be proactive in presenting why HANA should be a new piece on the board, and how HANA will provide long-term strategic advantage for both IT and the business.   Speed will get HANA in the door, a viable strategic plan will keep it in the game.

I’m sure there are other points I could bring up, but then I couldn’t use my cute acronym LIST.   Besides, SAP will be busy with just these four for years to come.

In-Memory Databases

October 31, 2012 Leave a comment
Driverless Car

Google Driverless Car

A couple years ago I wrote an email to the CEO of Cloudera about why I was so excited about Hadoop and the potential for distributed processing (he never wrote back 😦 ).  When I studied computer science we talked a lot about how processors could be coordinated to do work, and I saw MapReduce as an interesting real world application of distributed processing – although in this case the emphasis was on data distribution and retrieval.   I’m still a believer in what Hadoop and what its offspring can do for managing and reporting on large volumes of transaction data.

I am even more excited about the potential for in-memory data storage.    There is a real mismatch between the speed at which today’s CPU’s can process data and the speed at which they can get access to databases that reside on hard disk drives and even the new flash drives.   What happens when that bottleneck goes away?   What happens when the CPU is free to do its data processing 1000 times faster than it does today?

The folks in analytics talk about reducing batch jobs from hours to minutes.   They talk about increasing the complexity of jobs – with more data points in a time series or with more data from related data sources – without increasing the time required to do analysis.  This could be a real advantage for the corporate world.

I wonder what this means for even more interesting things like robotics and artificial intelligence.   Say for example, that my automated vehicle has all the data related to every trip it’s made from my home to the grocery store available to match real time with the conditions it senses (sees) ahead of it.   It “remembers” the blind driveway that my neighbor occasionally backs out of without looking, or the new oil they put down for a street repair last week that is slick in the morning dew.  Like a computerized chess master playing a board that is always changing, it can quickly run through 1,000 scenarios in near real-time, anticipating more moves and creating more contingencies than the typical human driver (especially the ones distracted by their smart phones).

We aren’t there yet.   In-memory databases are just now taking hold in the corporate world.  I wonder how soon this technology will trickle down and become affordable on desktops and mobile devices, and what kind of wondrous capabilities will be enabled?