Archive for August, 2012

Hadoop, The Big File System

I just saw this quote by John Santaferraro, VP of Solutions and Product Marketing at ParAccel Inc.  I couldn’t have said it better –

“It’s always interesting to hear discussion where Hadoop is positioned as the panacea for big data. I much prefer to adopt an approach that acknowledges a file system approach for what it is and what it does well. File systems like Hadoop are good for capturing data, archiving, filtering, transforming, and doing some batch analytics. Where Hadoop falls down is when companies try to write programs to use a file system to do complex analytics, or to do analytics where the data sources and algorithms are constantly changing. In like manner, there are analytic platforms, built on next generation database technology, that have been built from the ground up to execute high performance analytics on massive amounts of data. While new companies will spring up around Hadoop to visualize what is there, it will be extremely difficult, costly, and time consuming (in years) for companies to figure out how to use a file system to do analytics.

I would spin that a little and say that Hadoop may not be a “panacea” for big data, but it has a featured role as a “relatively” inexpensive store of lots of data you need to have “relatively” rapid access to.  It is not a data-warehouse or data-mart for doing predictive modeling.   All the really smart people working in the Hadoop ecosystem will come up with some wonderful new ways to manipulate data with Hadoop, but you will still need other database and next-generation database technologies to do the actual data manipulation required of advanced analytics.  It’s not an either-or proposition.

Categories: Uncategorized

Analytics, Say What?

I was in a company meeting when a sales manager presented a Powerpoint slide like the following and proudly proclaimed, “the sales trend is up:”PowerPoint Slide 1

Technically, he was right.   A trend line between the two data points was up, and the discussion moved on from the perspective that things in sales were going well.  Since the purpose of the meeting was to raise morale, this “analysis” of the data served its purpose.

However, the sales numbers for 5 periods looked more like the following:Imagine the direction the discussion might have taken given this perspective!

I’m using this simplified example of how data is “analyzed” in the enterprise because when it comes to the new world of analytics, not enough emphasis is placed on managing the presentation and interpretation of the data.

I am a big proponent of Data Governance and the emphasis on the quality, ownership, understanding, and representation of corporate data.   I think we need the same rigor around “Analytics Governance”.   The problem will be how to educate users and enforce standards without stifling curiosity and creativity.  It won’t be easy.

Categories: Uncategorized

I Put That Sh*t On Everything

Image of Frank's RedHot SauceWell-constructed marketing messages have a life at all levels within and outside the organization.   If you have a well-constructed marketing message, everyone inside and outside your company knows what your products do, or at least what you do that is in their best interests.  For example, Google went to market as a search engine.   From the beginning, everyone who ever heard of Google knew what Google was for.  How many realized it was a channel for advertising?  Imagine if Google had gone to market as a tool for receiving targeted ads?

 I think about this a lot because I work for technology companies and sell products and services to large enterprises.   I rely on marketing messages created by others that are embedded in advertisements, marketing collateral, white papers, and corporate presentations.  Sometimes they are right on the money.  Too often, they are techno-speak that a) no one can remember, and b) provide no clear definition of why anyone would use the company or product.

I can’t knock product marketing for failing to construct great messages about what products do and why I should sell them / customers should buy them.   It is really hard to do, especially when there are deadlines that must be met for product and release launches.

The best messages I’ve ever seen came from working directly with customers who have a keen sense of what works for them and what it will take to sell their management on a technology product.   For example, back in the early days of Teradata I worked in a group called Industry Consulting that worked with customers on some very detailed proof-of-concept projects working with their data to show them the benefits of having an enterprise data warehouse.  I wasn’t in the retail group, but one of their most popular stories was about working with a retailer doing market basket analysis wherein they discovered that the most common item purchased with diapers was beer.

The “beer and diapers” story about how to determine what-goes-with-what took off like wildfire and was part of the repertoire of benefits successfully pitched by every sales person at the company (in spite of the fact there is some contention about the truthiness of the discovery.)  A customer story that falls into the category of “you can’t make this stuff up” is golden.   The thing is, you really have to work with customers directly to get these stories, and then marketing has to be able to pivot quickly to adopt and promote them.

If you are lucky, the message will be so clear and memorable you will be able to “put that story on everything.”

Categories: Uncategorized

Data Virus

Ben Franklin

“An Ounce Of Prevention Is Worth A Pound Of Cure” – Ben Franklin

A close friend has recently been struggling with some kind of stomach bug.  That got me thinking about the computer equivalent of a virus and how it gets introduced into our systems.   We usually think of a virus as a program that wreaks havoc on the operations of a system.  I wonder if there is a parallel form of “contamination” in some of the data that our systems digest.

If we are basing analysis on event data from sensors, web interactions, and call detail, (aka big data) what impact is there if there is a rogue (broken yet still functioning) sensor, some form of web server hack, or duplicate entries from a software malfunction in a switch?  How serious does it have to be before we detect anomalies in the data, or before it impacts the results of analysis and modeling?  Could someone introduce bad data intentionally to corrupt or influence the outcome of analysis and modeling by others?

If we are going to base important business and societal decisions on event data, then we should be thinking about the computer virus model and building analogs for the prevention and detection of “bad data” at the source.    In other words, preventing the inclusion of bad data in our collection or sourcing instead of trying to detect it and isolate from our analysis after the fact.  Think of it as the data equivalent of firewalls and virus detection for data.

Can we learn something from event processing?  Event processing is a method of monitoring and performing analysis on data as it is coming into a system.   In other words, instead of waiting for all the data to be collected and performing queries, the analysis is performed on each piece of data flowing into the system and the results combined with all the previous entries so that the analysis is building on itself as the data is being generated.   An example in the financial services industry is detecting unusual trading practices from equities traders during market hours.  It is kind of like taking the analysis to the data, instead of taking the data to the analysis.

Event processing tools are somewhat expensive because they are new and they are typically being built for processing “complex” events from disparate but related sources to make inferences.   The cost of implementing these tools at this time may exceed the cost of cleaning data or isolating “bad” data after the fact.  However, if there is enough demand,  I expect to see some of the intellectual property being used for complex event processing applied to detecting and isolating suspect data either at the source or in the collection stream.  The demand will happen when there are more users of more big data, and when users realize critical business decisions may be suspect because of issues with bad data.

However, technology alone won’t stop humans from corrupting the sources of data or influencing outcomes by introducing “bad” data.  We will still need robust data governance and security models because event processing won’t capture every data imperfection.   If we effectively leverage both technology like event processing and processes like data governance we can tip the scales in favor of prevention and in the end make well informed decisions and predictions that can be defended and reproduced.

Latest Pictures From Mars Rover

Hadoop Is Everywhere!

Categories: Uncategorized Tags:

The Data Refinery

August 8, 2012 1 comment

Refinery Photo

Vermin Inc / Free Photos

Analysts often talk about the process of turning corporate data into information as an information factory.   The factory analogy is a good one in an environment where the raw materials for the data warehouse are sourced from applications that have already refined and packaged the data into transactions or tables.  Much of the processing of the data is “assembly” on a production line – joining data so the whole is more useful than the sum of the parts.

The introduction of lots of event data into the mix on the front end as we see in today’s “big data” makes me more inclined to think of it as a data refinery.  This data is more like raw crude oil that has to be transformed into other substances like gasoline and kerosene before they are useful.

This thinking was reinforced by a post entitled “The Data Stack: A Structured Approach” by Gil Elbaz at where he describes his perspective on the technology layers required to make data useful.  He’s building the mother of all data sources (though that description doesn’t really do his business justice).

For the enterprise, I think the stack today maybe looks something like the following:

Raw Material – Transaction data from enterprise applications and event data from other systems

Sourcing – Determining where to get the data, when, and at what point in the lifecycle of the data at it’s source.   Do you want “raw” event data from the source, a transaction log that sequences and formats the data, or will you rely on an application or service provider who manipulates the data before you get it?

Connection  – The method and timing for acquiring the data.  Will you rely on file transfers or web application programming interfaces? Is the data needed “real time”, or on a periodic basis?  What is the extraction method from the data source?

Quality/Provenance – When and how to determine the quality of the data and how to retain source information about the data so that subsequent analysts will have some measure of it’s reliability

Collection – Where to put the data.   This tends to be where most of our thinking seems to go – what database will we use?   This is where there can be a real divergence between what we’ve traditionally used as an operational data store for event data and the data warehouse for transaction data.

Combination – The information that is common between data that will be used to join them.  Determining or deriving keys to make this possible can be really hard for event data that doesn’t come from traditional enterprise applications.  Event data, for example, may not contain explicit identifiers like <customer number> that can be used to join the data to your CRM or Marketing databases.

Derivation – How new data can be derived from existing data.   For example, a point of sale event gets a new segmentation code appended to it based on the contents of the entire market basket purchased by the customer.

Access Methods – The methods that will be used get access to the data.   Will it be made available to other programs and systems or does it need to be made available in a human readable format.

Presentation – The reporting and analytic methods that are used to understand the data

Stewardship – The method for identifying who is responsible for the quality and availability of the data all along the refining process.   Who do you call when you want to know what the data in a field represents and how it got there?

Security – The protection of data as a corporate asset, and the protection of personal information by law.

I’m sure I’ve left some things out, but I will have to think about it some more.

Categories: Uncategorized

R Wins?

Robert A. Muenchen at the University of Tennessee wrote a blog entry entitled “Will 2015 be the Beginning of the End for SAS and SPSS?.”   In it, he projects that R will overtake SAS and SPSS as the tool of choice by analysts in 2015 based on it’s popularity with professors and college students.  It is a rational argument.

Based on my experience working around Silicon Valley, I would posit that R is the tool of choice for startups who are often staffed by recent college students and their professors.

However, I also have one foot in the enterprise IT world where SAS and SPSS are well entrenched.  There is so much corporate investment in these products that it will be costly to replace them.  That isn’t to say that the IT buyer won’t be considering alternatives when their annual renewals hit the budget.

But – there will be other considerations besides costs.   SAS and SPSS are both surrounded by software environments for the entire data lifecyle.  SAS, for example, is attempting to expand it’s footprint in the enterprise, with new visualization software, in memory and grid computing that act as datamarts, and new software for the extraction, transformation, and loading of data.  Like IBM, SAS wants to offer an entire analysis ecosystem making it more attractive to the enterprise IT shop.  The open source ecosystem for R is not that mature.  If it was easy for open source to penetrate the enterprise, wouldn’t MySQL (the open source relational database) be the corporate standard?

However, a large company like Oracle could be the point of integration for all things R, with price points that are attractive – similar to what they are doing with MySQL.  That would give R the tailwind from the academic community and coporate respectability.

In the meantime, it will likely be a case of using the best tool for the job/budget.  To quote a commenter (sorry I don’t remember where I saw your comment):

“I learned R to graduate, SPSS to get a job, and SAS to make a living”

Categories: Uncategorized Tags: , , ,