Vol #11 | What are open table formats
Have you recently heard about Iceberg, Hudi or Delta Lake?
Very good morning to all my readers. Writing this on a rainy Saturday morning in Pune, India.
Today, I’m going to introduce you to the world of open table formats. You might have already worked on these but might not be aware that these are called table formats in the lakehouse world!
What are table formats?
A table format is a layer on top of the file formats like parquet, avro, and orc. These bring in database-like features to data lakes.
Table formats help to group multiple files so that you can easily access & analyse the data.
Hive table format
The most common & widely used table format is Hive.
If you have worked on Hive, you will know that Hive provides a structure on top of files stored in HDFS. You can use SQL to query the data present on HDFS through Hive. (Hive also provides query engine & metastore, in this context, think of Hive as table format on top of HDFS)
Limitations of Hive table format
You cannot update data in Hive. It is only possible if you are using transactional tables (a feature available in Hortonworks which was not widely used)
The lowest grain is the partition. So if you want to update data in a table, you will have to drop the complete partition & re-create it.
Many enterprises are now adopting new formats to build their data lake and lake house to address these issues and provide support for advanced features like time travel.
These table formats are the backbone of lakehouse architectures that provide Data-warehouse capabilities on data lake.
The 3 formats - Iceberg, Hudi, Delta
3 most popular open table formats that are presently being discussed & explored are
Apache Hudi - Developed at Uber, Hudi is an Apache Top-Level Project and was one of the first of these 3 that Cloud providers supported. I remember exploring Hudi for building an AWS EMR-based ecosystem in 2020
Apache Iceberg - Iceberg is currently getting a lot of attention from the data community, and most of the popular (Snowflake) & new products(Dremio) have started supporting it. Iceberg supports a variety of file formats like Parquet, ORC and Avro.
Delta Lake - Delta Lake was created by Databricks & open-sourced as part of the Linux Foundation. In the recent "Data & AI" summit, all the features of delta lake were open-sourced by Databricks.
All these table formats provide features like ACID Support, Time travel, and performance optimisation to help you build lakehouse architectures. However, they differ in their approach to maintaining & managing the changes in data.
This post's intent is to introduce you to the world of table formats. I've not gone into deeper details about these formats. For those who want to explore these further, there are some excellent blogs that I came across during my research. You can refer to it for further understanding of these topics.
Some of these are listed below for your reference
Getting started - For beginners to understand more about these formats
Comparison of the table formats - Deep Dive to understand the differences
The adoption rate of these table formats - In case you are evaluating these & want to finalise one of these in your projects
Hope this helps you to get started with your open table format journey.
On another note, if you are exploring Databricks and planning to get certified, do check out this tweet for certification exam training and a free voucher!
Thanks for reading! Have a great weekend.