Archive

Posts Tagged ‘Big Data’

Data Virus

Ben Franklin

“An Ounce Of Prevention Is Worth A Pound Of Cure” – Ben Franklin

A close friend has recently been struggling with some kind of stomach bug.  That got me thinking about the computer equivalent of a virus and how it gets introduced into our systems.   We usually think of a virus as a program that wreaks havoc on the operations of a system.  I wonder if there is a parallel form of “contamination” in some of the data that our systems digest.

If we are basing analysis on event data from sensors, web interactions, and call detail, (aka big data) what impact is there if there is a rogue (broken yet still functioning) sensor, some form of web server hack, or duplicate entries from a software malfunction in a switch?  How serious does it have to be before we detect anomalies in the data, or before it impacts the results of analysis and modeling?  Could someone introduce bad data intentionally to corrupt or influence the outcome of analysis and modeling by others?

If we are going to base important business and societal decisions on event data, then we should be thinking about the computer virus model and building analogs for the prevention and detection of “bad data” at the source.    In other words, preventing the inclusion of bad data in our collection or sourcing instead of trying to detect it and isolate from our analysis after the fact.  Think of it as the data equivalent of firewalls and virus detection for data.

Can we learn something from event processing?  Event processing is a method of monitoring and performing analysis on data as it is coming into a system.   In other words, instead of waiting for all the data to be collected and performing queries, the analysis is performed on each piece of data flowing into the system and the results combined with all the previous entries so that the analysis is building on itself as the data is being generated.   An example in the financial services industry is detecting unusual trading practices from equities traders during market hours.  It is kind of like taking the analysis to the data, instead of taking the data to the analysis.

Event processing tools are somewhat expensive because they are new and they are typically being built for processing “complex” events from disparate but related sources to make inferences.   The cost of implementing these tools at this time may exceed the cost of cleaning data or isolating “bad” data after the fact.  However, if there is enough demand,  I expect to see some of the intellectual property being used for complex event processing applied to detecting and isolating suspect data either at the source or in the collection stream.  The demand will happen when there are more users of more big data, and when users realize critical business decisions may be suspect because of issues with bad data.

However, technology alone won’t stop humans from corrupting the sources of data or influencing outcomes by introducing “bad” data.  We will still need robust data governance and security models because event processing won’t capture every data imperfection.   If we effectively leverage both technology like event processing and processes like data governance we can tip the scales in favor of prevention and in the end make well informed decisions and predictions that can be defended and reproduced.

Advertisements

The Next Big Thing – Hadoop and NoSQL?

Cloudera, HortonWorks, and MapR are new companies that are commercializing Hadoop (an open source data management project).  As of November of last year investors had poured over $350 million into Hadoop and related NoSQL startups according to 451 Research.  Do the venture capitalists think Hadoop and other NoSQL approaches are the next big thing?  The answer is yes … or no, depending on your perspective.

If you are looking to store large amounts of event data as have Google, Yahoo, Ebay, Facebook and Linkedin, new data storage and management technologies like Hadoop are a necessity for the speed at which data can be stored and retrieved and to avoid the costs of licensing traditional database products.  Even if you are not one of these data giants, there may be some great performance enhancements and cost savings associated with moving some of your data store to Hadoop or NoSQL.

To be clear, the interest in Hadoop and NoSQL is for managing “big data” (see
The Current Big Thing – Big Data).   They are not a wholesale replacement for the database technology you’ve been using.  Let’s see where they fit.

The table below shows the different characteristics of 3 major kinds of data stores.

Key Value

Document

Table

Data Stored

Event

Document

Transaction

Schema

No

No

Yes

Philosophy

MAD

MAD

ACID

Examples

Hadoop

Couchbase

Oracle

Table storage, commonly referred to as relational or columnar, is the most popular today.  It’s been in use for over 30 years and we’ve pretty much worked the kinks out.   There are legions of data architects, administrators, and programmers trained and proficient in managing these database environments.   Most of the enterprise applications that run your business run on table driven databases and will for the foreseeable future.  One of the significant features of table driven databases is that they are great at storing and regurgitating transaction data where the data is always in the same format:

First Name

Middle Name

Last Name

Barry Hussein Obama
Willard Mitt Romney

This makes it very convenient for applications (like your ERP or CRM systems) to quickly access and update the data in the tables.  We would call this kind of data “structured” and say that it has a “schema” (as in every transaction has a place for first name, middle name, last name and the fields are always in the same order.)

One other important thing to mention here is that the databases you use to run your transaction based systems must be ACID compliant.   According to Wikipedia,  “In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee that database transactions are processed reliably.”

This isn’t the time to go into exactly what ACID is and how it works, but suffice it to say you don’t want a non-ACID database running your core systems that require frequent updates like most of your ERP applications.

In contrast, the key-value and document stores require no schema.  It’s up to the programmer who uses the data to describe and interpret what is in the data that is stored/recovered.   For example, in a document store, the schema of the data is stored with the data.  One programmer may store the name with the schema as

<first>,<middle>,<last>,

Barry,Hussein,Obama

while another may store it as

<last>,<first>,<middle>

Obama,Barry,Hussein.

When you go to retrieve the data you have to read the schema first to know how the data is stored.   It may seem like extra effort is required when retrieving the data, but this is offset by how quickly and easily data can be pumped into the data store and how much flexibility it gives the programmer.  It also makes it really convenient to add new information to a document or a record, for example you can tack on maiden name when when it comes up by including the schema in the document that says the following data is <maiden name>.   This kind of thing is really handy with event data where you know there was an event but it could have taken a variety for forms.   For example, a website visiter may register for a white paper with just their name, others might include their company information, another might request a download without entering information.   The ability to “change the schema on the fly” gives us the ability to gather lots of event data without having to anticipate every possiblity in the database design.

In contrast to the ACID approach to data management, many Hadoop and NoSQL users have a philosophy interestingly described as MAD in a paper by Greenplum (now part of EMC).

MAD stands for Magnetic, Agile, and Deep.   Magnetic means going wide to capture all the different characteristics you can regarding the field of analysis including related data from different sources.  Agile means not waiting to structure the data but getting started with the analysis ASAP and letting it evolve as structure is uncovered in the data.  Deep means capturing as much history as possible and not relying on samples.

What you may be starting to see is that there are tradeoffs in the SQL and Hadoop/NoSQL approaches.  We will do some more comparisons in the next post.   Is it possible to have both?  A few venture capitalists seem to be betting on it.

The Current Big Thing – Big Data

By now you are probably sick of hearing about Big Data.  I know I am.   It’s like a pop song you can’t get out of your head because you hear it everywhere you go.

According to Wikipedia, “big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization. “

The fact is we can generate so much information so fast from web sites, social media, automated sensors, communications networks, and other computing related devices that it is becoming increasingly difficult to capture and store the data, let alone analyze it.

The problem with the term “big data” is that the word “big” is ambiguous, and certainly relative to your unique situation.  It kind of reminds me of the argument of what a recession is.  Most people know it when they see it.   They can certainly find lots of evidence of a recession – slow sales, slow economic growth, high unemployment (although to be fair, slow and high are ambiguous).  The economists have a quantitative definition for a recession.   It is two consecutive quarters of negative economic growth as measured by a country’s gross domestic product.

Most IT practitioners could probably describe some of the evidence of a big data problem like frequent meetings about how to archive data to free up disk space, complaints about insufficient historical data to do analysis and modeling, or the simple fact that data is coming in with no place to store it.  Would it be possible to have a quantitative measure to define big data – something like an increase in data inflows and storage needs of more than 10% in each of 2 consecutive quarters?

OK, maybe not, but I would propose that when someone starts talking “big data” we get them to be more explicit about what they mean as it pertains to the business at hand.  How about we quantify the problem, or, better yet, can we spend more time focused on exactly what “Big Opportunities” are presented to justify all the activity around solving a perceived “Big Data” problem.  Here’s the thing – many organizations haven’t been able to capitalize on data warehouses and business intelligence investments.  Just going down the path of the next big thing –  like big data – won’t benefit them until they have the plans, resources, and commitment to capitalize on a big data solution.

Finally, for companies that have a big data opportunity, there will be a host of new considerations around the way they manage meta data (descriptions of what the data represents), data governance (rules about how the data is used), data quality, data retention, etc. that will have a profound effect on type of analysis that can be performed and the reliability of the results.  My intent is to cover some of these in future posts.