Insights 7 min

How to master data

Hans Bot
Hans Bot
Senior Solution Architect
yenlo blog 2020 09 01 how to master data 600x350

How to master data | Yenlo blogMaster data management is important, and its importance is increasing rapidly. That’s what many CIOs tell me. Its formal definition, “data about the business entities providing context for business transactions”, seems almost innocent. But when you start thinking it through, from a data architecture perspective, it is the kind of data objects that are being used throughout all kinds of applications. Data you need in multiple domains. Often data without a clear owner or a unified definition. Often without a single source. Yet you have to guarantee consistency. It involves data objects with a broken object history. The kind of data where different properties are relevant in different domains. And the lifetime of a single master data object may span the lifetime of multiple master data management solutions.

I know, managing master data can be challenging indeed.

Authoritative references

Some master data is actually pretty generic, such as the countries in the world, their currencies, languages, and so on. There’s always debate about the content, but at least there is an international reference you can rely on. Some master data is more specific, for instance international patent information, of a list of all licensed banks and their swift codes. Even the internet domain name system is a type of managed master data. Additionally, many registries are available at a country-level, such as citizens, businesses, and healthcare providers. Nice if you’re active in one country only, but a hassle when your business spans multiple countries. If you like it or not, in a world where we still struggle to agree on a universal standard format for dates, or measures for temperatures, weights, and lengths, organizations are still pretty much on their own where it comes to their master data management.

Fragile approaches

Over time, there have been many approached to tackle master data management for the enterprise. Sharing database tables across applications was probably the first attempt. A centralized service, often implemented on top of a master data management solution, is a more recent one. Both are notoriously difficult to integrate. If you’re using any off-the-shelf application, say a warehouse management system, you cannot expect that to connect to your custom master data tables or to consume your custom services. So, you create a copy of the relevant data and implement a process to keep the copy updated with changes in the master. But now the replica data gets updated, and you’re out of sync. Or an item gets deleted in the master, while it’s still being referenced in the replica. Or the database schema gets updated. Or you upgrade to a new version of your master data management tool.

Believe me, it’s fragile.

Difficult to integrate

Off-the shelf applications are versatile – you can use them to manage any kind of master data. As a consequence, their database schemas and interface definitions tend to be very abstract. As an example, to update a customer object, you have to update a masterdata item of type customer, and to update a product, you update a masterdata item of type product. Obviously, while the interface is the same, the behavior is quite different. The customer has moved, or married, or died, while the product has a new supplier or is out of stock. All through one interface. Since the behavior is programmed in configuration tables, a simple configuration change can have a drastic impact on the behavior, breaking existing integrations without a manageable update to the interface definition.

Believe me, it’s messy.

A modern approach

Today, event-based architectures are all the rage. In here, you can track the entire life cycle of any master data object as a sequence of events. Events can originate from anywhere, in other words, there is no single owner. Moreover, there is no master deciding or approving the current state of an object. It’s a distributed system, where every endpoint is keeping its own version of the master data. There is only a shared event registry that every data source has to contribute to. In a way, it is a middleware approach to master data management. Keep the intelligence in the applications, allow for heterogeneity, just make sure you feed them all the data they need.

I have personally experienced that this approach can be very fruitful indeed.

But how, exactly?

As said, this can only work as an integration solution. Since we’re processing streams of events, a different stream for every type of master data, it’s a matter of publishing, transporting, enriching, filtering, transforming and consuming events. In some cases, you have to aggregate data from multiple streams. That’s it. It may sound complicated, but in fact it is quite straightforward.


There are basically two ways to publish an event from a data source to a stream. In a (micro)service architecture, you may opt to actively publish an event to an input topic on an event broker such as Solace, Kafka or NATS. Just make sure to do so with every change of the shared data. This way it is pretty easy to assure that changes in the implementation will be reflected in the publication of events. But this only works in the services you’ve custom built and possibly for some modern cloud-based solutions. For legacy applications and off-the-shelf software, you obviously need an adapter. This is how to create such an adapter.

WSO2 Streaming Integrator comes with a feature called “Change Data Capture”. It allows you to actively listen to your database transactions, and capture changes in real time. You probably guessed it, once captured, you can simply publish those to the relevant topic on your event broker. Problem solved. Just make sure this adapter is managed by the team responsible for the data source.

Each stream may have one or more input sources connected to it. Even if you have just one input system at the moment, your architecture should allow for multiple inputs, if only to support a future migration to a new input system. That’s why you need to normalize your events by transforming them into a harmonized format and classification system. Normalization might include aggregation as well. Aggregation is often needed when the data in the source system is kept more granularly than you want to publish it to your output streams (data objects stored in multiple tables).


I am well aware that every integration case is special. I also see a lot of reoccurring patterns. In this case, you probably want to start with a validation step to keep any garbage out of your event registry, before normalizing and storing the event. Next you want to use the Change Data Capture on your event store. Sure, you can also opt to process the event in a single flow, as long as you can make sure the database transaction has succeeded before you proceed. After all, you want to avoid having changes in your target data sources that are not traceable to events in your event registry.

Subsequent steps can cover a number of things, enrichment and aggregation being the regular ones. This is a different aggregation use case, to accommodate output streams that combine multiple business objects into a larger aggregate, e.g. a menu as a collection of recipes, a product as a collection of product variants, or a product category as a collection of products. If necessary, you can also do a number of calculations. In fact, there is an entire store with all sorts of extensions to choose from.

After processing, you publish your event to the output topic.


The last step is basically a mirror image of the first. You have one or more target data sources. Your (micro)services can simply subscribe to the relevant topic and implement their own integration logic as an intelligent endpoint. Alternatively, you can create a flow to filter and transform the events into an output format you can directly store in the receiving data source. Of course, you can also use some native interface, if available. Whether you prefer the streaming integrator or the micro integrator to implement the final step, WSO2 Enterprise Integrator 7 has got you covered.


Obviously, each receiving system would get its own flow, with its own filtering and transformation logic, tailor made to fit its particular needs. This is what makes the architecture easily extensible and maintainable at the enterprise level.

Final note

If you have a working master data management system, but looking to move to a more distributed microservice architecture instead, my suggested approach is your best friend. It allows you to gradually migrate from the existing to your target architecture, with a built-in fallback functionality to manage your risks. Whatever your master data challenge, I’m sure we can be of help.


Full API lifecycle Management Selection Guide

whitepaper hero
Get it now