Archive

Posts Tagged ‘Cloudera’

In-Memory Databases

October 31, 2012 Leave a comment
Driverless Car

Google Driverless Car

A couple years ago I wrote an email to the CEO of Cloudera about why I was so excited about Hadoop and the potential for distributed processing (he never wrote back 😦 ).  When I studied computer science we talked a lot about how processors could be coordinated to do work, and I saw MapReduce as an interesting real world application of distributed processing – although in this case the emphasis was on data distribution and retrieval.   I’m still a believer in what Hadoop and what its offspring can do for managing and reporting on large volumes of transaction data.

I am even more excited about the potential for in-memory data storage.    There is a real mismatch between the speed at which today’s CPU’s can process data and the speed at which they can get access to databases that reside on hard disk drives and even the new flash drives.   What happens when that bottleneck goes away?   What happens when the CPU is free to do its data processing 1000 times faster than it does today?

The folks in analytics talk about reducing batch jobs from hours to minutes.   They talk about increasing the complexity of jobs – with more data points in a time series or with more data from related data sources – without increasing the time required to do analysis.  This could be a real advantage for the corporate world.

I wonder what this means for even more interesting things like robotics and artificial intelligence.   Say for example, that my automated vehicle has all the data related to every trip it’s made from my home to the grocery store available to match real time with the conditions it senses (sees) ahead of it.   It “remembers” the blind driveway that my neighbor occasionally backs out of without looking, or the new oil they put down for a street repair last week that is slick in the morning dew.  Like a computerized chess master playing a board that is always changing, it can quickly run through 1,000 scenarios in near real-time, anticipating more moves and creating more contingencies than the typical human driver (especially the ones distracted by their smart phones).

We aren’t there yet.   In-memory databases are just now taking hold in the corporate world.  I wonder how soon this technology will trickle down and become affordable on desktops and mobile devices, and what kind of wondrous capabilities will be enabled?

The Next Big Thing – Hadoop and NoSQL?

Cloudera, HortonWorks, and MapR are new companies that are commercializing Hadoop (an open source data management project).  As of November of last year investors had poured over $350 million into Hadoop and related NoSQL startups according to 451 Research.  Do the venture capitalists think Hadoop and other NoSQL approaches are the next big thing?  The answer is yes … or no, depending on your perspective.

If you are looking to store large amounts of event data as have Google, Yahoo, Ebay, Facebook and Linkedin, new data storage and management technologies like Hadoop are a necessity for the speed at which data can be stored and retrieved and to avoid the costs of licensing traditional database products.  Even if you are not one of these data giants, there may be some great performance enhancements and cost savings associated with moving some of your data store to Hadoop or NoSQL.

To be clear, the interest in Hadoop and NoSQL is for managing “big data” (see
The Current Big Thing – Big Data).   They are not a wholesale replacement for the database technology you’ve been using.  Let’s see where they fit.

The table below shows the different characteristics of 3 major kinds of data stores.

Key Value

Document

Table

Data Stored

Event

Document

Transaction

Schema

No

No

Yes

Philosophy

MAD

MAD

ACID

Examples

Hadoop

Couchbase

Oracle

Table storage, commonly referred to as relational or columnar, is the most popular today.  It’s been in use for over 30 years and we’ve pretty much worked the kinks out.   There are legions of data architects, administrators, and programmers trained and proficient in managing these database environments.   Most of the enterprise applications that run your business run on table driven databases and will for the foreseeable future.  One of the significant features of table driven databases is that they are great at storing and regurgitating transaction data where the data is always in the same format:

First Name

Middle Name

Last Name

Barry Hussein Obama
Willard Mitt Romney

This makes it very convenient for applications (like your ERP or CRM systems) to quickly access and update the data in the tables.  We would call this kind of data “structured” and say that it has a “schema” (as in every transaction has a place for first name, middle name, last name and the fields are always in the same order.)

One other important thing to mention here is that the databases you use to run your transaction based systems must be ACID compliant.   According to Wikipedia,  “In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee that database transactions are processed reliably.”

This isn’t the time to go into exactly what ACID is and how it works, but suffice it to say you don’t want a non-ACID database running your core systems that require frequent updates like most of your ERP applications.

In contrast, the key-value and document stores require no schema.  It’s up to the programmer who uses the data to describe and interpret what is in the data that is stored/recovered.   For example, in a document store, the schema of the data is stored with the data.  One programmer may store the name with the schema as

<first>,<middle>,<last>,

Barry,Hussein,Obama

while another may store it as

<last>,<first>,<middle>

Obama,Barry,Hussein.

When you go to retrieve the data you have to read the schema first to know how the data is stored.   It may seem like extra effort is required when retrieving the data, but this is offset by how quickly and easily data can be pumped into the data store and how much flexibility it gives the programmer.  It also makes it really convenient to add new information to a document or a record, for example you can tack on maiden name when when it comes up by including the schema in the document that says the following data is <maiden name>.   This kind of thing is really handy with event data where you know there was an event but it could have taken a variety for forms.   For example, a website visiter may register for a white paper with just their name, others might include their company information, another might request a download without entering information.   The ability to “change the schema on the fly” gives us the ability to gather lots of event data without having to anticipate every possiblity in the database design.

In contrast to the ACID approach to data management, many Hadoop and NoSQL users have a philosophy interestingly described as MAD in a paper by Greenplum (now part of EMC).

MAD stands for Magnetic, Agile, and Deep.   Magnetic means going wide to capture all the different characteristics you can regarding the field of analysis including related data from different sources.  Agile means not waiting to structure the data but getting started with the analysis ASAP and letting it evolve as structure is uncovered in the data.  Deep means capturing as much history as possible and not relying on samples.

What you may be starting to see is that there are tradeoffs in the SQL and Hadoop/NoSQL approaches.  We will do some more comparisons in the next post.   Is it possible to have both?  A few venture capitalists seem to be betting on it.