Hello dear friends - I hope you all had a great finish to 2022!
Introduction
In this post, I’ll write about some data trends that I think will be top priorities for data enterprises. It is based on the multiple webinars, talks, and summits I have attended over the last 2-3 months.
Let’s get started
Lakehouse Architecture
This is one of the most talked about initiatives in most summits/webinars.
Everyone now wants to build a lakehouse instead of a data warehouse and a data lake.
All the leading data platforms now have products/features for implementing a lakehouse.
Databricks is the market leader & seems to have the best solution being the original creator of Spark & Delta lake.
Apache Iceberg is getting adopted by AWS services like Athena, EMR, and Glue.
Snowflake now supports implementing a lakehouse using Iceberg tables.
There seems to be a clear shift towards building lakehouses instead of enterprise warehouses. If you have not yet explored lakehouses, now is the right time to read and understand how it works and their advantages.
Data Mesh
I’ve been hearing about Data Mesh throughout 2022. Every modern data enterprise seems to be discussing and planning to try it. But it’s not that easy.
Data Mesh is not just a architectural change - its a org level initiative and needs a mindset change at how and who should own and manage the data.
Data Mesh is based on 4 main pillars.
Domain Ownership - domain teams are responsible for their data.
Data as a Product - domain teams should treat their data as a product and make it available for other domains or downstream consumers.
Self-Serv Data Infra - dedicated team to manage the data platform and enable domain teams to leverage this platform for their use cases.
Federated Governance - standardization of data products across domains to make it easier to manage, share and adhere to industry and regulatory standards.
Read more about Data Mesh to understand what it means. If you are new to the data world, you might need to understand how it can help in today’s modern data world.
Data Governance
Data Governance is a broad topic to discuss and understand. It consists of several initiatives for managing your data in a better way.
Some of the initiatives as part of Data Governance are
Data Quality - Validations and Improvement
Metadata Management and Data Discovery
Data Audit and Data Lineage
Access Control and Secure Data Sharing
Managing Master Data
Regular Review Process
None of these are new and have been implemented for many years for managing data in warehouses. However, most of them are challenging when managing data in data lakes or lakehouses.
Not many organizations have implemented these successfully for unstructured data stored in cloud object storage. You might see new initiatives in your organizations to implement these using the modern data stack.
Multiple products are available in the market to implement each of these, making it more challenging to find the right product for your specific use case. And the new architecture patterns and use cases like Lakehouse, Data Mesh, Data Products, and Data Market Place will make Data Governance more critical and challenging.
Real-time Processing/Streaming
Traditional data warehouses were populated at EoD (End of Day) or SoD (Start of Day) as a batch process. BI users would be happy to see their data (correct & complete) once a day. But as time has changed, decisions are now made more in real time.
You now want instant alerts for any credit card fraud or unauthorized access. Even real-time movie recommendations or flash sale alerts are required for quick decision-making.
As the world moves towards more real-time use cases, there will be a lot of demand for implementing architectures that can support such streaming analytics. 2023 might see many enterprises embarking on this journey of supporting streaming, near real-time, or micro-batch use cases.
Data Architecture & Data Modeling
And finally, this is my favorite one - more focus on Data Architecture and Data Modeling.
These are building blocks for implementing a data platform. Getting the right architecture blueprints and suitable modeling strategy for storing your data can help in the long run.
With the rise of Hadoop, data modeling has taken a bit of a back seat. Data, in any shape and form, was being dumped in the lake without any modeling guidance. This soon resulted in a data swamp, making it extremely difficult to discover and use the data.
Since last year, I’ve heard many industry experts talking about the need for the right architectures and modeling. It seems like Data Modelers are back in demand, and enterprises now want to store their data in lakes, lakehouses, or warehouses using the most suitable modeling approach - the Dimensional Model or Data Vault.
It is certainly an important aspect when building a data platform. Keep an eye on various talks around Data Architecture and Data Modeling.
Summary
5 data trends that you should watch out for in 2023
Lakehouse Architectures
Data Mesh
Data Governance
Streaming/Real-Time processing
Data Architecture & Data Modeling
I’ve started a substack chat thread for this topic. If anyone wants to discuss this further, we can connect on Substack chat. I’ll also post links for various webinars/talks/events you can attend to get more knowledge of these trending “data” topics. Read this post to see how to install the substack chat app.
Thanks for reading, and best wishes for a fantastic 2023!