Posts Tagged ‘Data Governance’

The BI Cliff

November 28, 2012 Leave a comment

Cliff Danger SignI was traveling the last two weeks spending time with customers.   Many of them are thinking about where they are going to make their IT investments for next year.

A common theme is the need for self-service BI – pushing reporting and light analytics out of IT and into the hands of the business users.   I know what you are thinking – we’ve been doing this for years (though some might argue, not so successfully).   The difference now is that many IT shops have no choice.  They are facing a BI Cliff that will hurt their credibility and create friction with their business users.

Business users are clamoring for “big data” analysis and “real-time” analytics.  We used to queue up their report requests and set delivery expectations for weeks (if not months).   Now they have the expectation they should have all data immediately for visualization and modeling.  And all this has to be done without increasing the IT resources requred to cater to their needs such as:

  • Request/Workflow management
  • Report Writers
  • Data Stewards
  • Programmers
  • Dashboard Developers
  • Business Analysts
  • Data Administrators
  • ETL Programmers

BI Cliff

In my consulting business, we work hard with the business users to help them to clearly identify, qualify, and prioritize their BI and analytic needs.  We put the onus on them to justify the business need and the investment required from their organization to do the reporting and analysis they think they need.   We help them to understand that they are responsible for data governance – that they own the data and must master what the data represents and how it can/cannot be used.    Only then do we begin the IT planning for new tools and capabilities to improve IT productivity or enable self-service BI.

At the end of the day, IT and Business are attached at the hip.  If IT gets pushed over the BI cliff, they will take the Business users with them.   And no one wants that.


Data Virus

Ben Franklin

“An Ounce Of Prevention Is Worth A Pound Of Cure” – Ben Franklin

A close friend has recently been struggling with some kind of stomach bug.  That got me thinking about the computer equivalent of a virus and how it gets introduced into our systems.   We usually think of a virus as a program that wreaks havoc on the operations of a system.  I wonder if there is a parallel form of “contamination” in some of the data that our systems digest.

If we are basing analysis on event data from sensors, web interactions, and call detail, (aka big data) what impact is there if there is a rogue (broken yet still functioning) sensor, some form of web server hack, or duplicate entries from a software malfunction in a switch?  How serious does it have to be before we detect anomalies in the data, or before it impacts the results of analysis and modeling?  Could someone introduce bad data intentionally to corrupt or influence the outcome of analysis and modeling by others?

If we are going to base important business and societal decisions on event data, then we should be thinking about the computer virus model and building analogs for the prevention and detection of “bad data” at the source.    In other words, preventing the inclusion of bad data in our collection or sourcing instead of trying to detect it and isolate from our analysis after the fact.  Think of it as the data equivalent of firewalls and virus detection for data.

Can we learn something from event processing?  Event processing is a method of monitoring and performing analysis on data as it is coming into a system.   In other words, instead of waiting for all the data to be collected and performing queries, the analysis is performed on each piece of data flowing into the system and the results combined with all the previous entries so that the analysis is building on itself as the data is being generated.   An example in the financial services industry is detecting unusual trading practices from equities traders during market hours.  It is kind of like taking the analysis to the data, instead of taking the data to the analysis.

Event processing tools are somewhat expensive because they are new and they are typically being built for processing “complex” events from disparate but related sources to make inferences.   The cost of implementing these tools at this time may exceed the cost of cleaning data or isolating “bad” data after the fact.  However, if there is enough demand,  I expect to see some of the intellectual property being used for complex event processing applied to detecting and isolating suspect data either at the source or in the collection stream.  The demand will happen when there are more users of more big data, and when users realize critical business decisions may be suspect because of issues with bad data.

However, technology alone won’t stop humans from corrupting the sources of data or influencing outcomes by introducing “bad” data.  We will still need robust data governance and security models because event processing won’t capture every data imperfection.   If we effectively leverage both technology like event processing and processes like data governance we can tip the scales in favor of prevention and in the end make well informed decisions and predictions that can be defended and reproduced.

The Current Big Thing – Big Data

By now you are probably sick of hearing about Big Data.  I know I am.   It’s like a pop song you can’t get out of your head because you hear it everywhere you go.

According to Wikipedia, “big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization. “

The fact is we can generate so much information so fast from web sites, social media, automated sensors, communications networks, and other computing related devices that it is becoming increasingly difficult to capture and store the data, let alone analyze it.

The problem with the term “big data” is that the word “big” is ambiguous, and certainly relative to your unique situation.  It kind of reminds me of the argument of what a recession is.  Most people know it when they see it.   They can certainly find lots of evidence of a recession – slow sales, slow economic growth, high unemployment (although to be fair, slow and high are ambiguous).  The economists have a quantitative definition for a recession.   It is two consecutive quarters of negative economic growth as measured by a country’s gross domestic product.

Most IT practitioners could probably describe some of the evidence of a big data problem like frequent meetings about how to archive data to free up disk space, complaints about insufficient historical data to do analysis and modeling, or the simple fact that data is coming in with no place to store it.  Would it be possible to have a quantitative measure to define big data – something like an increase in data inflows and storage needs of more than 10% in each of 2 consecutive quarters?

OK, maybe not, but I would propose that when someone starts talking “big data” we get them to be more explicit about what they mean as it pertains to the business at hand.  How about we quantify the problem, or, better yet, can we spend more time focused on exactly what “Big Opportunities” are presented to justify all the activity around solving a perceived “Big Data” problem.  Here’s the thing – many organizations haven’t been able to capitalize on data warehouses and business intelligence investments.  Just going down the path of the next big thing –  like big data – won’t benefit them until they have the plans, resources, and commitment to capitalize on a big data solution.

Finally, for companies that have a big data opportunity, there will be a host of new considerations around the way they manage meta data (descriptions of what the data represents), data governance (rules about how the data is used), data quality, data retention, etc. that will have a profound effect on type of analysis that can be performed and the reliability of the results.  My intent is to cover some of these in future posts.