Vol #2 | What is Data Catalog

and how it can help to discover data?

May 06, 2022

photo of library with turned on lights — Photo by 🇸🇮 Janko Ferlič on Unsplash

Data Catalog is where you store metadata of your data platform.

Data Catalog holds metadata of tables present in your data ecosystem.

It can also have additional information like

Tags related to tables or attributes. E.g. Production or Dev tables; attributes classified as sensitive etc.
Business meaning of attributes. E.g. Cust_FN means Customer First Name, Cust_AD means Customer Address

Why do we need a Data Catalog?

A catalog is essential to discover data present in the system.

It can help users understand data available in the system & to query the required attributes of interest.

Without metadata, you cannot query your data using SQL based tools like Athena or Presto.

How to implement a Data Catalog?

Files present in a data lake like S3 can have their metadata data created in AWS Glue Data Catalog.

You can use services like AWS Glue crawlers to crawl the S3 files & extract metadata & store them in the Catalog.

Once the metadata is stored in the Catalog, the data can be easily accessed & queried from querying services like Amazon Athena.

What has changed in the Data Catalog space over the last few years?

A lot of new features are getting added to Catalog. It is no more just a system to store metadata.

Data Catalog is now being used

to understand lineage (how data flows from source to target)
to classify sensitive & PII data
to understand if there are any duplicates anywhere in the system
to find various env (dev/test/prod) where a particular table exists
to understand the impact on the number of tables if a specific attribute name, data type or length changes

I recently attended India's first Data Engineering Summit - DES'22 organised by "Analytics India Magazine" & learnt about an exciting concept called "Intelligent Data Catalog". I'll explore this further & write about it in one of my future emails.

I hope you have liked this week's topic. Stay tuned for the next volume!

Share Data & Cloud

Data & Cloud by GT

Discussion about this post