It is an unpleasant experience to take a bite from a sweet looking apple, only to spit it right back out after realizing the apple is rotting from the inside. Your expectation of a scrumptious apple is shattered. This is something you would probably like to avoid. This blog however is not about apples or food. This blog is about other things that can start to rot overtime, without you realizing it. This blog is about rotting data.
With reference to the image above, would you use data like this? The answer is not as clear cut as with the apple. However, the question to ask yourself is the same as with the apple: how old is it? If we do not know the answer to this question, the data should not be used anymore. Because data can rot, just like an apple.
What does rotting (data) mean?
When you look at it from a natural perspective, it means that an object, for example an apple, is decaying, making it unsuitable for consumption. The reason for decay is that an apple consists of organic material. Decay in organic materials is influenced by air circulation, temperature, humidity and chemical composition. Other materials also decay, not due to bacteria but because of oxidation. Take as an example iron surfaces. Iron reacts to oxygen in the air, resulting in rust. The material will become weaker and it will ultimately fade away completely. From that perspective, rotting in conjunction with data is a bit of a misnomer. I am still using it because it sends a clear message: data can be outdated and therefore should not be used anymore.
What are examples of outdated (rotten) data?
You might think that data cannot rot once it has been entered into a system. And yes, there is no physical decay because data will stay the same, and in some cases will stay relevant. Some data is even valid forever, for example your place and date of birth. Another example, if you have a database of round dining room tables, the diameter, radius, and area (A = π × r2) for each table will always stay the same. But not all data is that stable and some data can rot. An example of data that can vary in validity and therefore rot is for example the number of tables that are in stock, or the price of a table.
I am using the word ‘rot’ because no one in their right mind will eat a rotting apple. I believe that you should not use rotting data either.
Variable customer information
If you are collecting customer data, such as names, addresses etc., this data might be subject to change. People move. Customers leave. People die. That thousand customer list potentially becomes less reliable.
Time related data can decay from the moment it is stored
So, data rot has to do with certain types of data that can get outdated for a multitude of reasons. But the way that data rots is different from the way an apple rots. Time is one of the most important factors. Data can already start to decay from the moment it is stored.
Working with outdated data
Let’s say you have a business with 1.000 customers in your database. For marketing purposes, your marketing assistant makes an export of your customer database to target customers with personalized mailings about new products, insights, etc. It’s a lot of work to export and import personal details into the CRM system, so your marketing assistant does not update this export regularly. The original export of that abstract 1.000 customer database that was made, to what extent will that export still be valid after six months? Will you have the same 1.000 customers after six months, or will you have more, or will you have less? If you have more. You are missing opportunities. If you have less, you are targeting people that are no longer your clients. In case you have different customers, for example you still have 1.000 customers but 20% of these are new customers, then again, you are not hitting the mark.
How can you prevent your data from becoming outdated?
There are many more examples of data that has the potential to rot. I would suggest that you look at all your data, to see if you have a potential data rot. The more an organic product is being exposed to air and the warmer the temperature is, the faster the rotting process will take place. What are the rotting criteria for your data? What makes your data rot way faster than you imagined? And what can you do about this?
Store and use it in the right way
An easy way to keep milk fresh for longer is to put it in a fridge. Data, of course, is not an organic product. Data is something that you need to keep up to date for it to be reliable to use. But you also need to store and use it in the right way! By this I mean that you must look at your data to see whether it’s still accurate, but you must also look at whether it (still) fits the purpose that you are using it for. Storing someone’s age as 42 years does not make sense and it is bound to be outdated, unless you would also record the date of the observation. However, if you instead store a person’s date of birth, the age can be derived and this will keep your data fresh and prevent it from rotting.
One unified model?
Things can get more complex when data is being stored in different places and in different forms. There is no one big data model that keeps your data freshly stored for all connecting environments. And it’s a good thing that this does not exist, because the complexity of a unifying model that connects all data in a usable way, is, as far as I am concerned, part holy grail part Fata Morgana. In other words, it would be great if we had it, but it really is not feasible from the perspective of costs and complexity. But also from the perspective of uniformity on data and its meaning, as people can look at the same data and perceive it in different ways. Take for instance a bank account. It is possible to have an overdraft on a checking account (a way to earn money for banks) but not on a savings account.
The value of historical data
Old data does not necessarily have to be rotten. Historical data can be analyzed and result in certain patterns to be detected. An example is the seasonal influence on sales. Although yesterday’s data doesn’t necessarily say much about today, it perhaps does say something about the same day in the next week. Old data is not necessarily worthless, but it is also not worth its weight in gold. Per situation you need to determine the potential of data and validate it with new data.
Incomplete data means missed opportunities
A customer can register him or herself as single when they order something from your webshop, but how do you know for sure that he or she is still single? Perhaps you can infer this from the orders that they made, but that is not a reliable way of doing it. It can result in missing data and this can translate into possible missing opportunities.
Manage your data
Keeping your data fresh starts with the way you store data. A well thought through data model will make sure that there is no redundancy in data storage, that there is an agreed Entity Relationship Diagram or Domain Model, and that the tables and fields being used can actually store the data. This seems logical, but there are situations where the same data can take on different forms of layout, for example US vs European dates. But also client addresses and zip codes can differ between countries.
Who manages the data?
Are you the main source of the data stored, or is the data coming from another organization? If another organization is the source, can they periodically supply you with updates? It is important that you make sure that the data you store is up to date. Is there an API that you can use to update the data?
One way of doing this is by letting customers update their own customer data. You should make this easy for them by building in all the help you can imagine, e.g., dropdown lists for cities etc.
Validation like for instance the PostNL API that helps check address data surely helps to keep the data fresh and correct.
Uncertainty grows when data ages
Finally, keep it fresh. Do not work with extracts for a long time but rather extract each time data is needed or provide an interface to that data. Such interfaces can be APIs but also Graph QL schemas and so on. You can even work with algorithms to analyze data, but be careful with the privacy rules. Consent is always a good idea in this day and age. Look for professional support if needed.
The most important takeaway is perhaps demonstrated by giving an example: if I eat an apple, I always look at it before I take a bite. The same rule applies when working with data, look at it before you delve in.
Thanks to Prof Dr Martin Kersten (CWI) for the original idea on data fungus.