Posts Tagged ‘HortonWorks’

The Next Big Thing – Hadoop and NoSQL?

Cloudera, HortonWorks, and MapR are new companies that are commercializing Hadoop (an open source data management project).  As of November of last year investors had poured over $350 million into Hadoop and related NoSQL startups according to 451 Research.  Do the venture capitalists think Hadoop and other NoSQL approaches are the next big thing?  The answer is yes … or no, depending on your perspective.

If you are looking to store large amounts of event data as have Google, Yahoo, Ebay, Facebook and Linkedin, new data storage and management technologies like Hadoop are a necessity for the speed at which data can be stored and retrieved and to avoid the costs of licensing traditional database products.  Even if you are not one of these data giants, there may be some great performance enhancements and cost savings associated with moving some of your data store to Hadoop or NoSQL.

To be clear, the interest in Hadoop and NoSQL is for managing “big data” (see
The Current Big Thing – Big Data).   They are not a wholesale replacement for the database technology you’ve been using.  Let’s see where they fit.

The table below shows the different characteristics of 3 major kinds of data stores.

Key Value



Data Stored
















Table storage, commonly referred to as relational or columnar, is the most popular today.  It’s been in use for over 30 years and we’ve pretty much worked the kinks out.   There are legions of data architects, administrators, and programmers trained and proficient in managing these database environments.   Most of the enterprise applications that run your business run on table driven databases and will for the foreseeable future.  One of the significant features of table driven databases is that they are great at storing and regurgitating transaction data where the data is always in the same format:

First Name

Middle Name

Last Name

Barry Hussein Obama
Willard Mitt Romney

This makes it very convenient for applications (like your ERP or CRM systems) to quickly access and update the data in the tables.  We would call this kind of data “structured” and say that it has a “schema” (as in every transaction has a place for first name, middle name, last name and the fields are always in the same order.)

One other important thing to mention here is that the databases you use to run your transaction based systems must be ACID compliant.   According to Wikipedia,  “In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee that database transactions are processed reliably.”

This isn’t the time to go into exactly what ACID is and how it works, but suffice it to say you don’t want a non-ACID database running your core systems that require frequent updates like most of your ERP applications.

In contrast, the key-value and document stores require no schema.  It’s up to the programmer who uses the data to describe and interpret what is in the data that is stored/recovered.   For example, in a document store, the schema of the data is stored with the data.  One programmer may store the name with the schema as



while another may store it as



When you go to retrieve the data you have to read the schema first to know how the data is stored.   It may seem like extra effort is required when retrieving the data, but this is offset by how quickly and easily data can be pumped into the data store and how much flexibility it gives the programmer.  It also makes it really convenient to add new information to a document or a record, for example you can tack on maiden name when when it comes up by including the schema in the document that says the following data is <maiden name>.   This kind of thing is really handy with event data where you know there was an event but it could have taken a variety for forms.   For example, a website visiter may register for a white paper with just their name, others might include their company information, another might request a download without entering information.   The ability to “change the schema on the fly” gives us the ability to gather lots of event data without having to anticipate every possiblity in the database design.

In contrast to the ACID approach to data management, many Hadoop and NoSQL users have a philosophy interestingly described as MAD in a paper by Greenplum (now part of EMC).

MAD stands for Magnetic, Agile, and Deep.   Magnetic means going wide to capture all the different characteristics you can regarding the field of analysis including related data from different sources.  Agile means not waiting to structure the data but getting started with the analysis ASAP and letting it evolve as structure is uncovered in the data.  Deep means capturing as much history as possible and not relying on samples.

What you may be starting to see is that there are tradeoffs in the SQL and Hadoop/NoSQL approaches.  We will do some more comparisons in the next post.   Is it possible to have both?  A few venture capitalists seem to be betting on it.