Centralized vs Decentralized data engineering

#data #engineering #dataengineering #architecture

Don’t let the coin flip decide

As a consultant in the data engineering field, I see my share of lakes and meshes. Often the question is raised, which option will fit the best, a data mesh or a data lake. This question boils down to, should data engineering be centralized or decentralized? This blog will introduce the two flavors, show their differences and give some clues as to what could be a good option for your enterprise. If you have any feedback or questions, feel free to let me know in the comments. If you feel that your feedback does not fit in that small box or if you want to know more, please reach out to me at 'ruud.cools at codecentric dot nl'.

Centralized data engineering

The characteristic of this option is that there is a single component responsible for a specific task, an example is the data lake. This data lake is a single centralized repository for all enterprise data. One of the goals of the lake is to break down the data silos that inhibit the reuse of data throughout multiple departments of a company. Having a single source of truth for information that is essential for business is another one. Another characteristic of the centralized approach is the dedicated team or department responsible for the lake and all the essential components that surround it, e.g. compute, metadata, and security. Often, they are also responsible for transforming processes or elements of a process, a so-called use case, to become data driven or allow data supported decision making. During the development of the use case, the required data is ingested into the lake and possibly also the transformed data is stored back into the lake, completing the circle.

If we take a small step back and list the key elements of centralized engineering we see the following:

A single component is responsible for one or more tasks
A single group responsible for the lake and essential components
Driven from a business perspective to transform processes
Grows organically as new use cases are included
Often a highly standardized methodology is used

Decentralized data engineering

Similar as in application engineering, there is a trend to split up tasks and responsibilities into small independent parts. By doing so, the learnings and best practices of the well known micro service architecture can be applied. This means that a task is no longer the responsibility of a single team, technology or more generally speaking component. These components do not have to be maintained and managed by a single team or department.

Many services can be created and by linking the right components use cases are developed. These linked services, referred to as nodes, create a graph of nodes and links, that is the data mesh. For instance, a data lake could be accompanied by other storage solutions such as relational databases, all serving the same purpose of storing data and allowing reuse by different departments. The lake and the individual databases are the nodes within the mesh. The same goes for all the critical components mentioned earlier. The best in breed solution for a specific use case can be selected and made part of the mesh.

Information that can help business take more informed decisions or improve processes are the data products that are created in the mesh. Having good lifecycle management of a data product and the nodes that create it is important. By doing so management of the mesh becomes possible. I.e., stale nodes are removed from the mesh, nodes receive the required maintenance and nodes that have a lot of business value are treated as first class citizens.

The most important benefit of decentralizing data engineering is the scale it can take. This benefit comes from the micro service architecture. I.e., work can be spread out to multiple teams working on independent nodes, best of breed solutions can be chosen for specific usage within the mesh. Existing operational applications could be added to the mesh if analytical requests can be handled by the application, or if the data can be decoupled from the operation.

However, with the good also comes the bad. Having decentralized data engineering can lead to a chaos of nodes and a cobweb of links between them. This chaos starts when there are duplicate, stale, unmanaged or undescribed nodes. Management of the nodes can be complicated and expensive. Another disadvantage is that the expertise and knowledge to handle and provide analytical data in contrast to handling and storing operational data has to exist in all the teams that contribute to nodes in the mesh. Otherwise the quality of the node cannot be guaranteed and it will fail to fulfill the expectations. Product owners also need to know how to manage a data product, next to the application that they already manage, which introduces additional load and complexity to an already difficult job. Introducing additional load and complexity to the already difficult job. Often these disadvantages can be managed by having a supreme metadata catalog and data product management. Keep in mind that these do not come cheap and are a big load on the enterprise.

Summarizing the key principles of the decentralized methodology are:

Responsibility is distributed across multiple teams and components
Ability to scale to the required size
Use best of breed solutions
Having high quality metadata can help in restraining complexity
Lifecycle management of data products and nodes is key to maintain the mesh

Selection

While deciding on the data engineering strategy, the most critical question is the scale in which the enterprise wants to actually become data driven. Asked to the board, the answers often will be all and everything, indicating to use a distributed strategy. But as mentioned earlier, the cost of doing so might outweigh the benefits. Another common scenario is that next to the vision of becoming data driven, teams are pushed forward by the technological possibilities vague benefits and solutions. In this scenario, it might be unclear on what scale the transformation needs to be. If an agile approach is taken, a minimal viable product will be created based on a single use case, or just a few. This product often has the characteristics of a centralized strategy. Onboarding new use cases will only cement this strategy further making the switch complex as often the solutions for the disadvantages of a decentralized strategy are not implemented and catching up on the slack is difficult, complex and expensive.

Luckily, from an outside perspective, it is possible to assess the characteristics of the business and in collaboration with the stakeholders it is possible to give suitable advice on which strategy will bear the most fruit, much better than the good old coin flip. The main points to consider are:
The dependency of data on the core business
The scope and complexity of the operation
The number, profit and diversity of clear and well established use cases

Please, feel free to reach out if you want to discuss your challenges in data engineering further.

the/experts. Blog

the/experts. Blog is a community of amazing users

Centralized vs Decentralized data engineering

Don’t let the coin flip decide

Centralized data engineering

Decentralized data engineering

Selection

Discussion (0)

Read next

Mastering Mockito's MockedConstruction feature

Keycloak - Configuration as Code Pt. 3

With buildpacks to the moon!

Mastering Void Method Testing with Mockito & JUnit 5 in Spring Boot Applications