Imagine you’ve been collecting data for your business. So far so good. At first, it seemed like a fairly straightforward task. You build a database server and start populating it with data. You can then integrate that data into the apps and services you build and manage, and it seems like you’re an unstoppable force.
But then things begin to grow, and grow, and grow, until eventually that standard in-house data storage is no longer big, fast, and/or powerful enough. You can’t keep scaling your business because the database server you’re using isn’t capable of handling the extreme demand placed on your apps and services by an ever-growing user base.
What Do You Do?
Your first thought might be to switch to a NoSQL database, which can be better suited for handling larger stores of data. But even a NoSQL database will be limited within a typical database deployment.
You need to think bigger. Much bigger.
When your data demands reach that kind of scope, it’s time to consider either data mesh or data lake. But what are these different technologies and which one is the best fit for your company? Let’s dive in and see what’s what.
What Is Data Mesh?
To understand what data mesh is, you first must have a basic knowledge of what decentralized computing is, as this is key to the technology.
Imagine, if you will, you have a service that you run on a machine in a data center. That service can only be found and accessed on that one machine. Not only does this mean you have absolute control over the service, but it also means the machine housing the service must be able to meet the demands placed on it.
This scenario is very common — in fact, it’s the most popular method in use today. It’s called centralized computing. The problem with centralized computing is that it’s not just isolated, but it can tend to be a bit fragile. Even if you have failover set up for that server, what happens if the network the server is attached to goes down? At that point, no one would be able to access your service.
With that in mind, you should now see why centralized data isn’t always the best idea.
That’s where decentralized computing comes into play. How this works is by spreading the load out over multiple machines across the internet. So instead of one server (or a cluster of servers) within your data center housing that data, it’s spread across possibly hundreds of machines outside of your data center. That means, should one machine or network go down, the data is still accessible.
Data mesh follows this same idea. It’s a decentralized data store where no one entity claims ownership of the data. Now, before you think of this as a security problem, it’s not. Although the idea behind decentralized computing does shift the ownership a bit wider than you might typically care for, data mesh doesn’t exactly follow the same principle. You still own the data, and, should you choose, the data is only accessible by you and/or the apps and services that depend on the data.
Instead, the data is spread across data lakes and/or data warehouses such that they can be accessed 24/7 by applications and services. This also has the added benefit of being able to scale to massive proportions.
But to further understand the concept behind data mesh, there are four ideas you must understand:
- Domain-oriented ownership – This isn’t about domain names but, rather, organizations that are created around a specific task or need. You might have domains like development, management, and administration. Each of those domains will probably require access to that data.
- Data as product – The best way to understand data as product is by way of Zhamak Dehghani’s article, wherein she states, “Domain data teams must apply product thinking with similar rigor to the datasets that they provide; considering their data assets as their products and the rest of the organization’s data scientists, ML and data engineers as their customers.”
- Self-serve data platform – This concept is all about the autonomy of each domain and the ability to access the data contained within the data mesh.
- Federated computational governance – This embeds all data governance into the workflows of each domain.
Another difference between typical data storage and data mesh is that we’re transitioning away from traditional monolithic data infrastructure into more of a microservices-type infrastructure. In this new form, each domain handles its own data and data pipeline, and it requires yet another component to serve as the connecting tissue to act as a universal interoperability layer.
By allowing each domain to handle its own data, things can run more efficiently and effectively and deliver information at almost real-time speeds.
In order to fully grasp what data mesh is, you must also rethink how you examine data. With data mesh, data becomes a product. This is achieved by creating a self-service infrastructure so that data is more accessible to shareholders, ergo redefining data as a product. Because of that, data mesh can provide a more resilient approach to dealing with data such that a business can more easily respond to changes.
Current data management approaches are typically based on complex and heavily integrated ETL between operational and analytical systems struggling to change in time to support the business needs in a timely fashion in the face of these drivers. The purpose of data mesh is to provide a more resilient approach with respect to data to efficiently respond to these changes.
Benefits of Data Mesh
Data mesh can provide the following benefits:
- Better clarity into the value of data
- Better data availability, because of the decentralized nature of the mesh
- Faster innovation thanks to the shift from manual, batch-oriented extract, transform, load (ETL) to continuous transformation and loading (CTL)
- Greatly reduced data engineering times with the help of CI/CD, no-code, self-service data pipeline tools, and agile development
- Data democratization, thanks to self-service applications from multiple data sources
- Cost-effective data storage and processing
- Far less technical debt
- Better interoperability.
- Improved security and compliance
It’s also important to understand the components that comprise a data mesh, which are:
- Data sources
- Data infrastructure
- Domain-oriented data pipelines
Although most data storage uses the first two components, only a data mesh functions with domain-oriented data pipelines. Such pipelines also lend the data mesh a level of observability and governance that other structures do not. And given how so many businesses are making the shift toward agile, observability and governance are key to making that transition.
What Is Data Mesh Used For?
There are four very important use cases that are perfectly suited for data mesh:
- Business intelligence dashboards, which can greatly help an organization make sense of the data and the underlying performance.
- Virtual assistants such as chatbots are a great example of how a data mesh can be used.
- Customer experience relies heavily on massive stores of decentralized data to help improve how companies deal with customers, clients, and other businesses.
- Artificial intelligence (AI) and machine learning (ML) can be greatly enhanced by using domain-agnostic data. By using data mesh, the performance of both AI and ML can be dramatically increased.
What Is the Difference Between Data Lake and Data Mesh?
To understand the difference, we must first define a data lake, which is a data repository that stores data in its most raw form, without a unified schema. Consider a data lake as a vast storage pool that houses data that can be used immediately or in the future. This data is typically stored as either object blobs or files and has very little (if any) organization.
Data lakes can store both structured and unstructured data and can do so at any scale. So whether you have a small collection or a massive trove of data, the data lake is there to serve.
However, the data lake is stored in a centralized repository. Remember, one of the key components of data mesh is that it’s decentralized, which makes these two ideas vastly different. Where a data lake typically stores data either on a single machine or a cluster of machines on the same network, a data mesh stores data on machines and networks spread out across the internet.
Advantages of a Data Lake
The advantages of a data lake include:
- Speed – both for creating and analyzing data
- Low cost – Consumer-grade hardware and open-source technologies can be used.
- Reduced waste – Data lakes conserve resources because much of your data will remain idle until it’s used.
Characteristics of a Data Lake
- Can use relational and non-relational databases from many sources (such as IoT, websites, mobile apps, social media, and traditional applications)
- Schema on read, which means everything is written at the time of analysis
- Faster query results
- Raw data support
- Perfectly suited for data science, data development, and business analysis
- Support for machine learning, predictive analysis, data discovery, and profiling
- Supports the comprehensive analysis of both big and small data from a single location
- Very low latency
- Data analysis can be done at any time.
Another big difference between data mesh and data lakes is that data mesh is an ideal setup for the distribution of data to different departments, branches, and locations of a company. With a data lake, you don’t have that kind of flexibility or control over the different data pipelines.
How to Select the Best Option for Your Business
One thing to keep in mind is that this is not strictly an either-or proposition. For example, if your company already makes use of a data lake, you can add a data mesh to integrate decentralized, domain-specific data pipelines into the mix.
However, for those businesses who are just starting to dip their toes into this forum, consider this: data mesh is best suited for businesses with minimal (or no) infrastructure and that don’t have the time to bother with setting it up. Because of the way data mesh works, a business doesn’t have to rely on its own infrastructure, so it can spend more time and budget on developing the applications and services that will make use of the data.
On the other hand, a data lake is the better option for big data, where massive troves of data must be stored, prepared, and analyzed over a period of time.
Data Mesh or Data Lake?
To sum it all up, consider these points:
- Massive troves of raw, unstructured data that must be stored and processed at a later time are best served by a data lake. If your company needs to archive huge amounts of raw data, the data lake is the best option.
- If cost is an issue, a data lake is the obvious option.
- If you need real-time data analysis, insights, and reporting, a data mesh is the ideal route.
- If your business model requires speedy gathering of data from disconnected systems for immediate processing and analysis, a data mesh is the best choice.
- If data at scale is your key priority, either a data lake or data mesh will do just fine.
As far as the disadvantages of each, consider these points:
- A data lake is exponentially slower at delivering insights from raw data.
- Only the data mesh is truly scalable (especially for real-time analytics on data).
- Data mesh is not available via cloud service providers and, instead, operates on your on-premises servers or a third-party host.
To easily sum up the differences, a data lake is the best option for storing huge amounts of data in a centralized location, whereas a data mesh is best suited for immediate data retrieval, domain-driven data pipelines, and more widespread integration.