Practical Master Data Management

December 14, 2012 Leave a comment

We spend a lot of time talking with customers about master data management (MDM).  Here are a few bullets to convey the meat of the conversation.

1. Why MDM?

You didn’t need MDM when you had one application and one database.   The nature of things like customer, product, and supplier were understood by everyone who used the application and the data.  As soon as you built or acquired another application with its own database that referred to the same things, you introduced the potential for a mismatch in the description (attributes) attributed to those things.

For example, the billing system has customer address as a Post Office Box and the shipping system has the physical customer address as the shipping address.  What address should a new marketing campaign use?  If I want to do market segmentation based on zip code, and the zip codes are different, which one should I use?

This example is overly simplified.   The problem is usually more along the lines of the customer having 10 source systems with 10 things each having 50 attributes and no automated way of applying the rules to make the decisions about the “right” attributes to use for 100 million occurrences of those “things.”

2. Master Data And Reference Data Are Different

Some customers get confused about the difference between master data and reference data.   Reference data is something like the table of 2 character state postal codes we use in the USA.  They rarely change and are used by all your applications for consistency.  As we will see, Master data is the result of combining two or more sources to get the “best” combined representation.

MDM Example

3. Master Data Is Operational

In the data-to-day operation of the business, Master Data is about reconciling the difference between two or more systems and having a single version that represents what someone in the company has defined as the “truth” (at least according to the way they use the data.)  You wouldn’t need Master Data Management (MDM) if you changed your operations so that everyone was using the same source data, entering all new data related to the same entity the same way, and all data imported from outside sources conformed to your data quality and business rules.  In real life, few organizations are prepared to go back and change their operations to meet these criteria, so you are resigned to Master Data Management to address the need for a consistent view of the items you master.

4. Data Quality Precedes MDM

The process of matching disparate sources of data to identify master entities is enhanced significantly when the source data is cleansed and standardized prior to applying matching rules.   Strike that – cleaning up and standardizing the data is required before attempting to do MDM.

MDM Example

5. MDM Can Be A Step In A Process

Sometimes the output of an MDM process is used as a way to create a static Master File for input to other systems, or as input to a data warehouse or data mart.  In the latter case, we sometimes do some enhancement as well.  For example, combining attributes from multiple systems into a “wide” master record that has all the attributes from the source systems would be common for later analysis purposes.  Just to be clear, this augmentation is not MDM, it is enabled by MDM.  MDM is about the identification of an entity and clustering the source records that pertain to that entity.

MDM Example

6. Operational MDM Can Be Performed Real Time In A Hub

An MDM Hub is a server that provides MDM services on request.   The same rules developed for data quality, entity identification, and the creation of a “master” record can be performed upon request by a server so that the resulting master record reflects the actual current state of the source systems.  The hub can be used to:

  • verify that an item already exists in one or more systems (potentially eliminating the entry of duplicates)
  • standardize the application of data quality processes
  • share operational “rules” that span systems and processes (if billing and shipping addresses are different, then verify shipping address before committing the order)
  • provide an administrative interface for human intervention and workflow for exception handling

MDM Example

MDM processes and hubs need to be customized for every client situation.  It is an effort that has to involve the entire enterprise.

7. MDM As A Service

In the ideal future state, data quality enforcement and MDM could be standardized and provided as a service to all applications in the enterprise.   Centralizing these functions instead of performing them in different ways in each application could significantly reduce the amount of work that has to take place reconciling differences in the data.   Delivery via an enterprise cloud, or eventually shared services in a public cloud are possibilities.


SAP Announces Predictive Analytics Software

November 28, 2012 Leave a comment

I just noticed that yesterday SAP announced the general availability of SAP Predictive Analysis software based on R and HANA.   I’ve been racking my brain trying to figure out why SAS or SPSS wasn’t all over HANA given the incredible potential for real “real-time” analytics.  Now I know why.

I don’t know how robust the offering from SAP is at this point, but it can only get better as more and more analytical application developers adapt their offerings to R.

It remains to be seen how effective the SAP sales forces will be at “selling” analytics, especially since R is open-source.  I expect it will be more of a draw the other way, where R developers are interested in HANA as the underlying datastore for their applications.

Categories: Uncategorized Tags: , , ,

The BI Cliff

November 28, 2012 Leave a comment

Cliff Danger SignI was traveling the last two weeks spending time with customers.   Many of them are thinking about where they are going to make their IT investments for next year.

A common theme is the need for self-service BI – pushing reporting and light analytics out of IT and into the hands of the business users.   I know what you are thinking – we’ve been doing this for years (though some might argue, not so successfully).   The difference now is that many IT shops have no choice.  They are facing a BI Cliff that will hurt their credibility and create friction with their business users.

Business users are clamoring for “big data” analysis and “real-time” analytics.  We used to queue up their report requests and set delivery expectations for weeks (if not months).   Now they have the expectation they should have all data immediately for visualization and modeling.  And all this has to be done without increasing the IT resources requred to cater to their needs such as:

  • Request/Workflow management
  • Report Writers
  • Data Stewards
  • Programmers
  • Dashboard Developers
  • Business Analysts
  • Data Administrators
  • ETL Programmers

BI Cliff

In my consulting business, we work hard with the business users to help them to clearly identify, qualify, and prioritize their BI and analytic needs.  We put the onus on them to justify the business need and the investment required from their organization to do the reporting and analysis they think they need.   We help them to understand that they are responsible for data governance – that they own the data and must master what the data represents and how it can/cannot be used.    Only then do we begin the IT planning for new tools and capabilities to improve IT productivity or enable self-service BI.

At the end of the day, IT and Business are attached at the hip.  If IT gets pushed over the BI cliff, they will take the Business users with them.   And no one wants that.


November 8, 2012 1 comment

Katana SwordSAP HANA seems to be getting some traction.  I happen to believe in the future of in-memory databases.  It only makes sense that we should try and balance the mismatch between processor speeds and the availability of data for processing.

I think SAP has some work ahead of them marketing HANA.  Here is a LIST of things I think the need to start doing immediately to boost their chances of making it a generally accepted database in SAP and non-SAP accounts.

Language – SAP needs to own the term in-memory database.   In-memory is being widely used by Oracle and others to refer to lots of things that happen with data, without fitting the requirements for being a truly in-memory database as defined here .  The problem is that in-memory database may quickly become so over and mis-used it will become a useless marketing term like “big data”.  SAP needs to aggressively promote a strict definition for the label in-memory database and call out any vendor who misuses it.

Image– It’s hard for any database product to say it can do everything – we just don’t believe it anymore.   HANA needs to occupy one space in the mind of IT, and I believe that space is speed.   The advantages of data processing speed will apply regardless of where HANA is eventually applied, whether it is transaction processing or decision support, operational data store or data warehouse.   Everyone wants more data processing faster.   It is probably the most defensible position for HANA.

On a side note, I keep hearing HANA being pronounced as Hanna.   A woman’s name probably doesn’t invoke the image they are going for.   I would emphasize a pronunciation that rhymes with Katana – the deadly Japanese sword.

Sentiment – The Internet makes it possible to influence hundreds and thousands of potential HANA users negatively with a few key strokes.   SAP is going to have to actively monitor and immediately respond to mis-information and requests for information about HANA in every blog and competitor release.  Interestingly enough, the name HANA isn’t that hard to monitor because it is somewhat unique.

Technology – CIO’s and other IT leaders play an unfair game of chess every time they put together their IT strategy.   It’s them against IBM, Oracle, Accenture, and every other technology vendor who is playing their own strategy to take IT’s money and time.  And they don’t have to follow the rules – bypassing the CIO and going directly to business executives to change the rules.

SAP HANA is going to have to prove they are on the side of the CIO and working to advance his or her game.   They have to be proactive in presenting why HANA should be a new piece on the board, and how HANA will provide long-term strategic advantage for both IT and the business.   Speed will get HANA in the door, a viable strategic plan will keep it in the game.

I’m sure there are other points I could bring up, but then I couldn’t use my cute acronym LIST.   Besides, SAP will be busy with just these four for years to come.

In-Memory Databases

October 31, 2012 Leave a comment
Driverless Car

Google Driverless Car

A couple years ago I wrote an email to the CEO of Cloudera about why I was so excited about Hadoop and the potential for distributed processing (he never wrote back 😦 ).  When I studied computer science we talked a lot about how processors could be coordinated to do work, and I saw MapReduce as an interesting real world application of distributed processing – although in this case the emphasis was on data distribution and retrieval.   I’m still a believer in what Hadoop and what its offspring can do for managing and reporting on large volumes of transaction data.

I am even more excited about the potential for in-memory data storage.    There is a real mismatch between the speed at which today’s CPU’s can process data and the speed at which they can get access to databases that reside on hard disk drives and even the new flash drives.   What happens when that bottleneck goes away?   What happens when the CPU is free to do its data processing 1000 times faster than it does today?

The folks in analytics talk about reducing batch jobs from hours to minutes.   They talk about increasing the complexity of jobs – with more data points in a time series or with more data from related data sources – without increasing the time required to do analysis.  This could be a real advantage for the corporate world.

I wonder what this means for even more interesting things like robotics and artificial intelligence.   Say for example, that my automated vehicle has all the data related to every trip it’s made from my home to the grocery store available to match real time with the conditions it senses (sees) ahead of it.   It “remembers” the blind driveway that my neighbor occasionally backs out of without looking, or the new oil they put down for a street repair last week that is slick in the morning dew.  Like a computerized chess master playing a board that is always changing, it can quickly run through 1,000 scenarios in near real-time, anticipating more moves and creating more contingencies than the typical human driver (especially the ones distracted by their smart phones).

We aren’t there yet.   In-memory databases are just now taking hold in the corporate world.  I wonder how soon this technology will trickle down and become affordable on desktops and mobile devices, and what kind of wondrous capabilities will be enabled?

4 Legged Sales Teams

October 26, 2012 Leave a comment

HawaiiMichael Stonebraker casts stones at what he calls the 4 legged sales team about 48 minutes into this video on NewSQL databases.

He describes the 4 legged sales team as the combination of an overpaid sales rep “who’s only smart enough to take a customer to lunch” connected to a technically proficient sales engineer.  He says his firm VoltDB is open source because they don’t want to waste money on an expensive field sales force.

I found his comments shortsighted.  Mostly because he denigrates field SE’s – the good ones are truly amazing at getting customers on board with new technology.   He can say anything he wants to about sales people.  We are used to it.

I think he is confusing the “freemium” marketing model with open source.  In the freemium model, customers are enticed with a “free” version of the software with the intent of upgrading them to a paid/supported version later.   The freemium model has worked well for non-open source companies like   Open source and freemium worked well for Red Hat operating systems, and may be working well for the Hadoop vendors Cloudera and Hortonworks – though it’s still early to see how the Hadoop market develops.

However, whether it is freemium or open source or both, a company still eventually needs a high quality field sales force to sell enterprise customers on non-commodity technology.  Even, the poster child for the freemium model, has an enterprise field sales force.   A really good account executive becomes a trusted consultative resource to his or her account.   He personally knows who is for and against his technology, how to navigate the buying process at the account, and how to communicate clearly how his technology will provide a business advantage that justifies changing the way the enterprise does business.  Adding two more legs with a great SE who can focus on proving the technical value of the product or service results in more sales, bigger sales, faster sales cycles, and satisfied customers with solutions that can be proven to provide real business value.

No one can deny Stonebraker’s had some amazing business successes.   I wonder:  if he truly valued sales, would he be the one buying a private island in Hawaii.

Categories: Uncategorized

Governance Is Not (Necessarily) A Bad Thing

September 10, 2012 1 comment

I was recently “enlightened” by a very experienced data warehouse practitioner that data governance is a bad thing.  For those of you aren’t familiar with the concept, Wikipedia has the following definition:

governance relates to consistent management, cohesive policies, guidance, processes and decision-rights for a given area of responsibility. For example, managing at a corporate level might involve evolving policies on privacy, on internal investment, and on the use of data.

Now I have to admit his point.   He probably knows 10 times more about data management and processes than any data steward or data governance committee member responsible for establishing data governance policies.  When he goes to make changes to the way data is stored, processed, or used, he doesn’t have to think twice about what is best practice because he does it by nature.  The last thing he wants is some delayed process where he has to wait for approval from some data steward or committee.

I offer these observations:

  • Data governance is more for the 99 of 100 data users who don’t have his experience.  By comparison, all the security hassles we put up with as IT users is a productivity drain for those of us who are not a security risk, but we put up with it to protect corporate assets from the few who would put corporate assets at risk
  • Good data governance policies and procedures should take into account the capabilities and qualities of data professionals like him.  In other words, the policies related to him should be less restrictive than they are for the typical user.  We can do this by creating classes of users with different authorizations and privileges
  • We need his input and that of others like him into creating and maintaining a viable data governance program that serves it’s purpose without becoming another corporate bureaucracy that drains productivity
  • If it’s broke, let’s fix it.   The alternative to not having data governance is data anarchy resulting in poor decisions, bad investments, and lost opportunities.  Governance isn’t a bad thing.  Bad governance is a bad thing.  Good governance can be a good thing.

He also told me I was biased because I sell data governance consulting.   Guilty as charged.

Categories: Uncategorized