Home > Uncategorized > Data Virus

Data Virus

Ben Franklin

“An Ounce Of Prevention Is Worth A Pound Of Cure” – Ben Franklin

A close friend has recently been struggling with some kind of stomach bug.  That got me thinking about the computer equivalent of a virus and how it gets introduced into our systems.   We usually think of a virus as a program that wreaks havoc on the operations of a system.  I wonder if there is a parallel form of “contamination” in some of the data that our systems digest.

If we are basing analysis on event data from sensors, web interactions, and call detail, (aka big data) what impact is there if there is a rogue (broken yet still functioning) sensor, some form of web server hack, or duplicate entries from a software malfunction in a switch?  How serious does it have to be before we detect anomalies in the data, or before it impacts the results of analysis and modeling?  Could someone introduce bad data intentionally to corrupt or influence the outcome of analysis and modeling by others?

If we are going to base important business and societal decisions on event data, then we should be thinking about the computer virus model and building analogs for the prevention and detection of “bad data” at the source.    In other words, preventing the inclusion of bad data in our collection or sourcing instead of trying to detect it and isolate from our analysis after the fact.  Think of it as the data equivalent of firewalls and virus detection for data.

Can we learn something from event processing?  Event processing is a method of monitoring and performing analysis on data as it is coming into a system.   In other words, instead of waiting for all the data to be collected and performing queries, the analysis is performed on each piece of data flowing into the system and the results combined with all the previous entries so that the analysis is building on itself as the data is being generated.   An example in the financial services industry is detecting unusual trading practices from equities traders during market hours.  It is kind of like taking the analysis to the data, instead of taking the data to the analysis.

Event processing tools are somewhat expensive because they are new and they are typically being built for processing “complex” events from disparate but related sources to make inferences.   The cost of implementing these tools at this time may exceed the cost of cleaning data or isolating “bad” data after the fact.  However, if there is enough demand,  I expect to see some of the intellectual property being used for complex event processing applied to detecting and isolating suspect data either at the source or in the collection stream.  The demand will happen when there are more users of more big data, and when users realize critical business decisions may be suspect because of issues with bad data.

However, technology alone won’t stop humans from corrupting the sources of data or influencing outcomes by introducing “bad” data.  We will still need robust data governance and security models because event processing won’t capture every data imperfection.   If we effectively leverage both technology like event processing and processes like data governance we can tip the scales in favor of prevention and in the end make well informed decisions and predictions that can be defended and reproduced.

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: