Vol #6 | What is Data Ingestion
Hello everyone! Welcome to Volume #6 of “Data & Cloud”
I’ve been writing and publishing this newsletter every week for the last 6 weeks now! Thanks to all my subscribers and readers for your likes and comments.
Today we will learn about "Data Ingestion", which is one of the most common activities in any data project.
What is Data Ingestion?
Data Ingestion is extracting the data from source systems and loading it into the landing or raw layer of data lake.
In traditional DWH projects, this process was also referred to as the data extraction process. It was part of the standard ETL process, and 'E' referred to extracting the data from various source systems and loading the data into the staging area of the data warehouse.
What are the sources for extracting data?
There are various source systems from which data can be extracted and ingested into the data lake.
Structured Data Sources: These are sources like RDBMS systems or CSV files that have data in structured formats. In traditional DWH projects, these were the only source systems used for extracting data.
Semi-Structured Data Sources: Sources that send XML or JSON files with semi-structured data.
Unstructured Data Sources - These are sources like Social Media feeds with user comments or tweets that are unstructured. E.g. Twitter feeds data ingestion into the data lake.
What is the frequency of data ingestion?
Data ingestion can happen for batch data as well as streaming data. Based on your use cases, frequency needs to be decided.
Batch: Sources that send data in batches once a day. E.g. An OLTP system that sends End of Day files. Most of the traditional ETL systems had EoD batches running at midnight.
Micro-batches: Sources that send data in multiple batches in a day. These can have 15-minute or hourly interval batches that send data throughout the day.
Streaming - Sources that send data continuously as streaming messages like IoT sensors from wearables or aeroplanes sensors
What are the key considerations for data ingestion?
When we design and implement a data ingestion process, there are various factors that need to be considered. Some of the key considerations are mentioned below
Full data extraction: Do we need to extract complete data for each execution or is there a way to identify only the incremental data from source systems.
Incremental/Delta Changes - Can we identify the delta changes post the previous extracted batch and only ingest the data that has changed?
Impact on source system - Does the source system’s performance gets impacted when we are extracting data from it?
Multiple threads for extraction - Can we run multiple parallel threads/processes to extract data from the source system to improve the extraction performance?
Extract data in chunks/parts - Is there a way to extract data from source tables in chunks. E.g. Can we extract data on basis of partitions in the source table?
Performance - What are various ways to improve the performance of the data extraction/ingestion process
What are the modern tools for data ingestion?
As part of the modern data stack, there are multiple tools that can be used for data ingestion. These tools should support the below key features
Connectivity - Easy connectivity (built-in connectors) with various source systems like SAP, Salesforce, Zoho, Shopify, RDBMS, Files, Cloud Storages, and Streaming sources, IoT sensors etc.
Identify incremental data - Ability to identify incremental data to be extracted for fetching delta records without needing custom code for each table/entity.
Scheduling and Orchestration - Should be able to schedule batches to extract data at regular intervals as per the business needs.
Continous Sync - Should be able to perform continuous replication of data from source to target systems in real-time/near real-time.
Data ingestion is one of the foundational activities when starting any data project. It helps to create the first version of the landing/raw/silver layer in your data lake. It often gives an indication of the nature of data that needs to be further processed and modelled.
You can either use some of the modern data stack ETL/ELT tools to perform ingestion or can create your own custom frameworks based on big data technologies like Spark. In any case, ensure that you have a platform that is scalable and extensible for future data sources and you don’t need to write a new program for every new source table.
Hope you found this article helpful. Please comment & share with your friends.
Happy Learning!