Big Data, Big Hats

Data Migration – Big Data, Big Hats

This blog was first published in 2012 and is one of the most widely quoted of all my efforts. As you will see it was a lot of fun to write - I hope it is as much fun to read.  I've updated it slightly to take out links that no longer work but the essence remains.

Is Big Data just another Marketing wheeze and does it suffer from the same semantic issues that have bedevilled other MI/BI/DW oversells of the past 20 years?

This week I thought I’d share my musings on Big Data.  Is there any more to Big Data than marketing hype and another re-packaging of the “Better information is a competitive advantage” sales message for the Data Warehouse/ETL/BI/Data Mining companies?

In favour of the proposition that this is a real paradigm shift is that there is an awful lot more data around now than there was. As I have reported here in the past, many Telco's were de-tuning or switching off entirely some of their monitoring platforms due to the sheer volume of data and their inability to store it, never mind process it meaningfully.  We also have the social media phenomena – every click might tell us something if only we could store and analyse it successfully. So we really do have the possibility to deep dive in an ocean of data when previously we were only paddling in a few (quite large) lakes.

On the other hand the promises look awfully familiar and like any other new buzz word in our industry, already the various marketers are trying to extend their hegemony over this new terrain to the confusion of us bystanders.  But at bottom does Big Data, for all its wiz-bang impressiveness, not founder on the same rock of epistemological imprecision that has bedevilled so many initiatives from the first Codd and Date normalised data sets based on set theory (1970’s), through the endless wastes of corporate data models (1980’s), to the Data Warehouses (1990’s) and the BI tools (2000’s) of today?

And here I have an entertaining story. A very good friend of mine and I were at the recent Bloor/DQPro event in London chewing the fat generally and discussing targeted mailings and social networking sites in particular.  He told me that he had been looking to buy a fur hat for the wife of a friend. Now my pal is a gun enthusiast and I guess eviscerated dead animals as head gear is all the rage in those circles, not to mention being useful in keeping your head warm on those cold watches of the early morning when you are out stalking. Not that I would know, my use of weaponry is limited to the occasional visit to a pot shot stand at the local fair.  Anyway, he has since been swamped with invitations to go view various web sites that specialise in offal as apparel.  Our joint favourite being : where a slightly constipated looking model attempts to appear macho in a variety of faintly comical carcasses.

Of course if you check it out, you too may find that some canonical model will have you catalogued as a “Bestial Necrophiliac (sub-category Headgear). You have been warned.

And here in lies our problem. Canonical models.  At base, you have to believe that in the messy world of reality there are such things.  That the world really is ordered to some schema or set of schema’s that we can, with just the right amount of intellect (embodied in a particular application) and enough data (well as I say we’ve gone from databases to data warehouse and now to Hadoop arrays) we can analyse and infer these models. But what if we can’t? What if they do not exist except in limited, specific applications where the benefit of forgoing richness is repaid in the economy of process?

I will give you two examples.  Back in the mid 90’s I was working for a magazine publishing company who were one of the first, to my knowledge, to successfully utilise Data Warehousing technology for profit. They published and marketed an awful lot of titles which, in the UK went through 80 or so warehouses to nearly one hundred thousand retail outlets. Some titles were monthly, some quarterly, some weekly.  Some were general interest, some were specialist.  Some had a seasonal variance in take up (like gardening or wedding titles) and so on. Lots of data then, and a considerable waste in returns of inappropriately allocated magazines, not to mention the cost of manual allocation techniques.  By harnessing Data Warehousing they were able to dictate which title went to which retail store (“Boxing Out” in the parlance of the trade) in consequence of which they could renegotiate their relationship with the physical warehouse middlemen to the extent that the savings paid for the operating costs of the company for the next five years. And that after development costs.

But here the company concerned was prepared to accept the hit of misapplication of title to outlet on occasion.  The early cycles were replete with tales of small urban newsagents getting 500 copies of “Your Tractor Monthly” or some such.

Next there are my friends at IBM who are going to share a canonical banking model with me that I will hopefully blog about in the near future.

Again with this example (and I haven’t seen it so I can’t comment too much) what we have is a well understood domain – banking – and the application of a model that even if it isn’t quite-one-size-fits-all, will handle most, well understood, features of banking. Again the benefit of possible application and data sharing might out-weight the cost of being straight jacketed into a single set of definitions.

Indeed, in technology we all do this all the time. We routinely design applications that are a compromise between open flexible space and constrained common process (normally with an emphasis on the latter).

However, in the real world things just aint like that. The real world is complex and messy. We, in moments of idleness, click on the most obscure hyperlinks. We have diverse and even contradictory habits and whims.  We structure the world not according to some group canonical model, but according to our own idiosyncratic framing of reality. Some of these frames we share with others, but again, an inconsistent and shifting group of others. Some are uniquely our own. And they change from day to day. I pose the question once again – is the whole edifice of predictive analysis at a personal level not based on a mistaken view of how individuals go about the business of understanding and constructing their interests? Is it not epistemologically challenged? Is this why it is so much more pleasant to wander through a large bookshop like Foyles in London than Amazon on the web? Although Amazon is undoubtedly more efficient at one level, in Foyles I may think I am going in to buy a book on Big Data but come out an hour later with a biography of Winston Churchill and a book of poetry. Engaging in the process of procurement at Foyles alters the construction of my choice set. This happens to all of us all the time. Do any of us have fixed canonical constructions of our reality?

Will the capacity to analyse more data help businesses uncover these transient understandings and so replicate the experience of serendipity that the physical experience provides us with? Or will we continue to get told about what other people who looked at product A also looked at? Will one visit to the fur hat site forever condemn us to invitations to fur hat nirvana?  Artificial intelligence is not my field but the folks I speak to in this area seem pretty wedded to an externalist epistemology – i.e. there are ontologies in the real world that are independent of individual cognition and Big Data will help us uncover more of these models. Now whilst I am pretty much undecided on this one, I am quite sure that when it comes to individual decision making it is all about an individually constructed perception, especially where the decision is relatively trivial (shall I buy my best friend’s wife a fur hat or a handbag for her 50th birthday?)

This is not to say that on a scaled up basis it won’t help all of us in those more closed paradigm situations we looked at earlier. More data, better collected (and the data lineage of all this data is another question altogether) and better analysed will lead to better decision making in areas where we can be sure of the model and undoubtedly lead to competitive advantage for those who get it right. It will help us spot trends in data on the direction the herd is headed but I’m not so sure about predicting the movements of the individual.

I would write more but I’ve just received an email for a very interesting range of fox fur over-trousers which I really must investigate…..

Johny Morris