Data is not like oil – it is much more interesting than that

So, this may seem to be a nitpicking little note, but it is not intended to belittle anyone or even to deny the importance of having a robust and rigorous discussion about data, artificial intelligence and the future. Quite the contrary – this may be one of the most important discussions that we need to engage in over the coming ten years or so. But when we do so our metaphors matter. The images that we convey matter.

Philosopher Ludwig Wittgenstein notes in his works that we are often held hostage by our images, that they govern the way we think. There is nothing strange or surprising about this: we are biological creatures brought up in three-dimensional space, and our cognition did not come from the inside, but it came from the world around us. Our figures of thought are inspired by the world and they carry a lot of unspoken assumptions and conclusions.

There is a simple and classical example here. Imagine that you are discussing the meaning of life, and that you picture the meaning of something as hidden, like a portrait behind a curtain – and that discovering the meaning then naturally means revealing what is behind that curtain and how to understand it. Now, the person you are discussing it with instead pictures it as a bucket you need to fill with wonderful things, and that meaning means having a full bucket. You can learn a lot from each-others’ images here. But they represent two very different _models_ of reality. And models matter.

That is why we need to talk about the meme that “data is like oil” or any other scarce resource, like the spice in Dune (with the accompanying cry “he who controls the data…!”). This image is not worthless. It tells us there is value to data, and that data can be extracted from the world around us – so far the image is actually quite balanced. There is value in oil and it is extracted from the world around us.

But the key thing about oil is that there is not a growing amount of it. That is why we discuss “peak oil” and that is why the control over oil/gold/Dune spice is such a key thing for an analysis of power. Oil is scarce, data is not – at least not in the same way (we will come back to this).

Still not sure? Let’s do a little exercise. In the time it has taken you to read to this place in the text, how many new dinosaurs have died and decomposed and been turned into oil? Absolutely, unequivocally zero dinosaurs. Now, ask yourself: was any new data produced in the same time? Yes, tons. And at an accelerating rate as well! Not only is data not scarce, it is not-scarce in an accelerating way.

Ok, so I would say that, wouldn’t I? Working for Google, I want to make data seem innocent and unimportant while we secretly amass a lot of it. Right? Nope. I do not deny that there is power involved in being able to organize data, and neither do I deny the importance of understanding data as a key element of the economy. But I would like for us to try to really understand it and then draw our conclusions.

Here are a few things that I do not know the answers to, and that I think are important components in understanding the role data plays.

When we classify something as data, it needs to be unambiguous, and so needs to be related to some kind of information structure. In the old analysis we worked with a model where we had data, information, knowledge and wisdom – and essentially thought of that model as hierarchically organized. That makes absolutely no sense when you start looking at the heterarchical nature of the how data, information and knowledge interact (I am leaving wisdom aside, since I am not sure of whether that is a correct unit of analysis). So something is data in virtue of actually having a relationship with something else. Data may well not be an _atomic_ concept, but rather a relational concept. Perhaps the basic form of data is the conjunction? The logical analysis of data is still fuzzy to me, and seems to be important when we live in a noise society – since the absolutely first step we need to undertake is to mine data from the increasing noise around us and here we may discover another insight. Data may become increasingly scarce since it needs to be filtered from noise, and the cost for that may be growing. That scarcity is quite different from the one where there is only a limited amount of something – and the key to value here is the ability to filter.

Much of the value of data lies in its predictive qualities. That it can be used to predict and analyze in different ways, but that value clearly is not stable over time. So if we think about the value of data, should we then think in terms of a kind of decomposing value that disappears over time? In other words: do data rot? One of the assumptions we frequently make is that more data means better models, but that also seems to be blatantly wrong. As Taleb and others have shown the number of correlations in a data set where the variables grow linearly in turn grows exponentially, and an increasing percentage of those correlations are spurious and worthless. That seems to mean that if big data is good, vast data is useless and needs to be reduced to big data again in order to be valuable at all. Are there breaking points here? Certainly there should be from a cost perspective: when the cost C of reducing a vast data set to a big data set are greater than the expected benefits in the big data set, then the insights available are simply not worth the noise filtering required. And what of time? What if the time it takes to reduce a vast data set to a big data set necessarily is such that the data have decomposed and the value is gone? Our assumption that things get better with more data seems to be open to questioning – and this is not great. We had hoped that data would help us solve the problem.

AlphaGo Zero seems to manage without at least human game seed data sets. What is the class of tasks such that they actually don’t benefit from seed data? If that class is large, what else can we say about it? Are key crucial tasks in that set? What characterizes these tasks? And are “data agnostic” tasks evidence that we have vastly overestimated the nature and value of data for artificial intelligence? The standard narrative now is this: “the actor that controls the data will have an advantage in artificial intelligence and then be able to collect more data in a self-reinforcing network effect”. This seems to be nonsense when we look at the data agnostic tasks – how do we understand this?

One image that we could use is to say that models eat data. Humor me. Metabolism as a model is more interesting than we usually allow for. If that is the case we can see another way in which data could be valuable: it may be more or less nutritious – i.e. it may strengthen a model more or less if the data we look at becomes part of its diet. That allows to ask complicated questions like this: if we compare an ecology in which models get to eat all kinds of data (i.e. an unregulated market) and ecologies in which the diet is restricted (a regulated market) and then we let both these evolved models compete in a diet restricted ecology – does the model that grew up on an unrestricted diet then have an insurmountable evolutionary advantage? Why would anyone be interested in that, you may ask. Well, we are living through this very example right now – with Europe a, often soundly, regulated market and key alternative markets completely unregulated – with the very likely outcome that we will see models that grew up on unregulated markets compete with those that grew up in Europe, in Europe. How will that play out? It is not inconceivable that the diet restricted ones will win, by the way. That is an empirical question.

So, finally – a plea. Let’s recognize that we need to move beyond the idea that data is like oil. It limits our necessary and important public debate. It hampers us and does not help in understanding how this new complex system can be understood. And this is a wide open field, where we have more questions than answers right now – and we should not let faulty answers distract us. And yes, I recognize that this may be a fool’s plea, the image of data like oil is so strong and alluring, but I would not be the optimist I am if I did not think we could get to a better understanding of the issues here.

1 thought on “Data is not like oil – it is much more interesting than that”

  1. Interesting article on the epistemology of data. But even if we take the metaphor of data as crude oil, we need to put it into a refinery in order to produce a variety of streams that we can actually use in practice. Some streams we’ll have to feed into a cracker, to break them up even further. Then comes the magic of reassembling the small chunks (monomers in my business) into stuff that didn’t exist before (polymers).

Leave a Reply