What is Data Catalog?
Data Catalog is where you store metadata of your data platform.
Data Catalog holds metadata of tables present in your data ecosystem.
It can also have additional information like
Tags related to tables or attributes. E.g. Production or Dev tables; attributes classified as sensitive etc.
Business meaning of attributes. E.g. Cust_FN means Customer First Name, Cust_AD means Customer Address
Why do we need a Data Catalog?
A catalog is essential to discover data present in the system.
It can help users understand data available in the system & to query the required attributes of interest.
Without metadata, you cannot query your data using SQL based tools like Athena or Presto.
How to implement a Data Catalog?
Files present in a data lake like S3 can have their metadata data created in AWS Glue Data Catalog.
You can use services like AWS Glue crawlers to crawl the S3 files & extract metadata & store them in the Catalog.
Once the metadata is stored in the Catalog, the data can be easily accessed & queried from querying services like Amazon Athena.
What has changed in the Data Catalog space over the last few years?
A lot of new features are getting added to Catalog. It is no more just a system to store metadata.
Data Catalog is now being used
to understand lineage (how data flows from source to target)
to classify sensitive & PII data
to understand if there are any duplicates anywhere in the system
to find various env (dev/test/prod) where a particular table exists
to understand the impact on the number of tables if a specific attribute name, data type or length changes
I recently attended India's first Data Engineering Summit - DES'22 organised by "Analytics India Magazine" & learnt about an exciting concept called "Intelligent Data Catalog". I'll explore this further & write about it in one of my future emails.
I hope you have liked this week's topic. Stay tuned for the next volume!
Hello - Yes, Unity Catalog is similar to Glue Data Catalog; main advantage is that Unity catalog can be shared across various Databricks workspaces. You can have only a single catalog for your dev, test and prod environments & share metadata.
If you are using Databricks on AWS, based on your use case, select which one is most suitable for you.