Hello friends - I hope you all had a great “work” week.
Have you started going to the office again? If yes, do comment about how was your first day back to the office.
This week I'm going to introduce you to one of the hottest topics of 2022 in the data world. Yes, we are going to talk about Lakehouse.
First, let's try to understand all these concepts.
What is Data Warehouse?
Data warehouse is a place to store all your structured data. It can help to implement your BI workloads.
You can easily query using SQL & access the data present in the warehouse.
What is a Data Lake?
Data lake is storage to persist any data - structured, semi-structured, or unstructured data.
Data Lake started gaining much attention when Hadoop was introduced to the data world.
Data Lake supports in implementation of multiple workloads like streaming use cases, AI/ML use cases etc.
What is a Lakehouse
Lakehouse brings us the best of both worlds.
Lakehouse is implemented on data lake.
It does not have a separate warehouse. But it can support the running of SQL queries - just like you did on a warehouse.
Lakehouse is a data lake + with all the good features of a data warehouse
What are the benefits of Lakehouse?
Cost benefits of a data lake as these are built on the cheaper object storages like AWS S3.
Support all data - structured, semi-structured, or unstructured.
Supports all workloads - BI, AI, ML, Streaming, ETL, and Adhoc Querying.
Separate Storage & Compute. This is probably the most key advantage as compared to traditional data warehouses where storage & compute are bundled together.
You get all the good features of a data warehouse. i.e. Great performance while querying data, SQL support, ACID features (updates/deletes)
Available Products
While data bricks introduced the term lakehouse, there are other players also that provide a platform to implement a lakehouse.
Databricks & Dremio have offerings around Lakehouse.
In case you are looking for an open-source offering - you can explore the possibility to implement a Lakehouse using AWS S3 with Apache Iceberg (Storage) + Trino/Presto (Query Engine)
Summary
Lakehouse is a new concept which is gaining a lot of attention in the data world.
Lakehouse = Data Lake + Query Engine
Where
Data Lake can be implemented using Cloud Object Storage like AWS S3
and
Query Engine is the compute provide by query processers like Databricks, Dremio or Presto.
However, there is one main component which I have not introduced in this newsletter- Open Table Formats
The "Open Table Formats" give data lake power to have DW-like ACID & Time Travel features. Apache Iceberg is one such open table format & I'll write more about it in my following newsletter.
Till then you can explore more about this topic.
Here are a few links to read.
I hope you have liked this week's topic. Please comment/mail/DM & let me know your feedback.