Posts Tagged ‘Hadoop’

In-Memory Databases

October 31, 2012 Leave a comment
Driverless Car

Google Driverless Car

A couple years ago I wrote an email to the CEO of Cloudera about why I was so excited about Hadoop and the potential for distributed processing (he never wrote back 😦 ).  When I studied computer science we talked a lot about how processors could be coordinated to do work, and I saw MapReduce as an interesting real world application of distributed processing – although in this case the emphasis was on data distribution and retrieval.   I’m still a believer in what Hadoop and what its offspring can do for managing and reporting on large volumes of transaction data.

I am even more excited about the potential for in-memory data storage.    There is a real mismatch between the speed at which today’s CPU’s can process data and the speed at which they can get access to databases that reside on hard disk drives and even the new flash drives.   What happens when that bottleneck goes away?   What happens when the CPU is free to do its data processing 1000 times faster than it does today?

The folks in analytics talk about reducing batch jobs from hours to minutes.   They talk about increasing the complexity of jobs – with more data points in a time series or with more data from related data sources – without increasing the time required to do analysis.  This could be a real advantage for the corporate world.

I wonder what this means for even more interesting things like robotics and artificial intelligence.   Say for example, that my automated vehicle has all the data related to every trip it’s made from my home to the grocery store available to match real time with the conditions it senses (sees) ahead of it.   It “remembers” the blind driveway that my neighbor occasionally backs out of without looking, or the new oil they put down for a street repair last week that is slick in the morning dew.  Like a computerized chess master playing a board that is always changing, it can quickly run through 1,000 scenarios in near real-time, anticipating more moves and creating more contingencies than the typical human driver (especially the ones distracted by their smart phones).

We aren’t there yet.   In-memory databases are just now taking hold in the corporate world.  I wonder how soon this technology will trickle down and become affordable on desktops and mobile devices, and what kind of wondrous capabilities will be enabled?

Latest Pictures From Mars Rover

Hadoop Is Everywhere!

Categories: Uncategorized Tags:

The Next Big Thing – SQL And Hadoop And NoSQL and NewSQL?

In the last post we talked a little about the different approaches taken by traditional database vendors, primarily the relational database vendors, and the proponents of new Hadoop and NoSQL data stores.   What does this mean for you?

On the one hand, your IT infrastructure has a mature robust capability for the care and feeding of table driven (primarily relational) databases.  Relational databases are NOT going away.  (Oracle alone reported selling $4.49 billion in database and middleware licenses, updates, and support last quarter).  BUT, relational databases may not be a fit for “big data” environments where storage capacity, retrieval speed, and low cost are the primary requirements.  It is hard to compete with the perception of “free” –  Hadoop and many of the NoSQL alternatives are based on “free” open source software.   The vendors of these products sell enhanced versions, supported versions, and services related to the installation and support of those data stores.  Because of this, the Hadoop and NoSQL markets are relatively small (estimated at $77 million in 2011 by IDC) .

The relational market is evolving to handle larger volumes of data.   Teradata pioneered the concept of replacing the file structure underlying the relational tables with a distributed file structure within a server chasis using specialized hardware.   Companies like EMC with their Greenplum database have figured out how to distribute relational data across lots of commodity servers.   Most of the big relational database companies are finding ways to incorporate Hadoop into their offerings so that an analyst can use their existing tools to combine data from the table environment with unstructured.  They want you to keep buying database licenses.

The most likely scenario for most enterprises is that Hadoop and NoSQL will be used to store and stage data for analysis, similar to the way that today’s data warehouse is used to store and combine data for data marts where analysis takes place at a department level.  It’s the Operational Data Store in the table below.

Operational Data Store

Enterprise DataWarehouse


Data Quality and Transformation,

Business Intelligence and Analytics Where Detailed Historical Data Required

Business Intelligence And Analytics For Enterprise Needs

Business Intelligence and Analytics For Departmental and Line of Business Needs




Hadoop and NoSQL


Relational/Cubes/Specialized Data Structures

Basically we are building a data refinery to handle all new volumes and varieties of input.   It is an incremental cost to your organization, and will have to be justified for the business.

We will also see hybrid combinations of the best of both the relational and NoSQL models sometimes referred to as NewSQL, for example NuoDB.  The approach used by NuoDB of combining the best of relational and NoSQL is getting endorsements from the likes of Gary Morgenthaler (a co-founder of Ingres and Illustra) and Mitchell Kertzman (the former CEO of Sybase).

In the end, business needs should drive the analysis to be performed, and the needs of the analysis to be performed should drive the information technology required to store and process the data.   Don’t’ get caught up in the hype around “The Next Big Thing.”

The Next Big Thing – Hadoop and NoSQL?

Cloudera, HortonWorks, and MapR are new companies that are commercializing Hadoop (an open source data management project).  As of November of last year investors had poured over $350 million into Hadoop and related NoSQL startups according to 451 Research.  Do the venture capitalists think Hadoop and other NoSQL approaches are the next big thing?  The answer is yes … or no, depending on your perspective.

If you are looking to store large amounts of event data as have Google, Yahoo, Ebay, Facebook and Linkedin, new data storage and management technologies like Hadoop are a necessity for the speed at which data can be stored and retrieved and to avoid the costs of licensing traditional database products.  Even if you are not one of these data giants, there may be some great performance enhancements and cost savings associated with moving some of your data store to Hadoop or NoSQL.

To be clear, the interest in Hadoop and NoSQL is for managing “big data” (see
The Current Big Thing – Big Data).   They are not a wholesale replacement for the database technology you’ve been using.  Let’s see where they fit.

The table below shows the different characteristics of 3 major kinds of data stores.

Key Value



Data Stored
















Table storage, commonly referred to as relational or columnar, is the most popular today.  It’s been in use for over 30 years and we’ve pretty much worked the kinks out.   There are legions of data architects, administrators, and programmers trained and proficient in managing these database environments.   Most of the enterprise applications that run your business run on table driven databases and will for the foreseeable future.  One of the significant features of table driven databases is that they are great at storing and regurgitating transaction data where the data is always in the same format:

First Name

Middle Name

Last Name

Barry Hussein Obama
Willard Mitt Romney

This makes it very convenient for applications (like your ERP or CRM systems) to quickly access and update the data in the tables.  We would call this kind of data “structured” and say that it has a “schema” (as in every transaction has a place for first name, middle name, last name and the fields are always in the same order.)

One other important thing to mention here is that the databases you use to run your transaction based systems must be ACID compliant.   According to Wikipedia,  “In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee that database transactions are processed reliably.”

This isn’t the time to go into exactly what ACID is and how it works, but suffice it to say you don’t want a non-ACID database running your core systems that require frequent updates like most of your ERP applications.

In contrast, the key-value and document stores require no schema.  It’s up to the programmer who uses the data to describe and interpret what is in the data that is stored/recovered.   For example, in a document store, the schema of the data is stored with the data.  One programmer may store the name with the schema as



while another may store it as



When you go to retrieve the data you have to read the schema first to know how the data is stored.   It may seem like extra effort is required when retrieving the data, but this is offset by how quickly and easily data can be pumped into the data store and how much flexibility it gives the programmer.  It also makes it really convenient to add new information to a document or a record, for example you can tack on maiden name when when it comes up by including the schema in the document that says the following data is <maiden name>.   This kind of thing is really handy with event data where you know there was an event but it could have taken a variety for forms.   For example, a website visiter may register for a white paper with just their name, others might include their company information, another might request a download without entering information.   The ability to “change the schema on the fly” gives us the ability to gather lots of event data without having to anticipate every possiblity in the database design.

In contrast to the ACID approach to data management, many Hadoop and NoSQL users have a philosophy interestingly described as MAD in a paper by Greenplum (now part of EMC).

MAD stands for Magnetic, Agile, and Deep.   Magnetic means going wide to capture all the different characteristics you can regarding the field of analysis including related data from different sources.  Agile means not waiting to structure the data but getting started with the analysis ASAP and letting it evolve as structure is uncovered in the data.  Deep means capturing as much history as possible and not relying on samples.

What you may be starting to see is that there are tradeoffs in the SQL and Hadoop/NoSQL approaches.  We will do some more comparisons in the next post.   Is it possible to have both?  A few venture capitalists seem to be betting on it.