Series 2: Ingesting data to Data Lake
Guidelines For Ingesting Structured And Unstructured Data To Data Lakes
Ingestion involves connecting to various data sources, extracting the data, detecting the changed data and tracking the data flow. Structured, semi-structured and unstructured data is extracted from various heterogeneous source systems and initially loaded as-is into the store. The data ingestion layer is the first step for the data coming from variable sources into the Data Lake Store.
The data ingestion process can be referred to as the system of data flowing from its origin to one or more data stores, including data lakes.
The quality of the ingestion process would majorly correspond with the quality of data featured in the data lake.
If data is ingested incorrectly, then it can prove to be more difficult to analyze downstream, thereby putting the value of your data in jeopardy altogether.
In case the data is ingested properly, it shall arrive on the data lake at the right time, along with the correct fidelity, while being ready for data wrangling and analytic use.
In simple terms, the data ingestion process can be referred to as the system of data flowing from its origin to one or more data stores, including data lakes. This may even include search engines and databases.
The quality of the ingestion process would majorly correspond with the quality of data featured in the data lake.
Data Ingestion can be over in real-time data or batch data. Real- time data ingestion (otherwise called as Streaming Ingestion) for analytical or transactional processing enables businesses to make timely operational decisions that are critical to the success of the organization – while the data is still current.
Real-time data is generally ingested when it originates – for instance via Message Topics or Queues, Change Data Capture.
Data Lake Ingestion Framework (DLIF)
This Data Lake Ingestion Framework accelerates data ingestion and makes the ingested data ready in a form consumable by transformation and modeling layers, thus enabling the delivery of streamlined and timely analytics
Key Challenge
Without the need to iterate, well-designed data lakes open up a plethora of possibilities:
Running cross-functional data analysis
Building reconciliation systems and reporting systems without impacting the transactional system
Practically infinite compute and storage by virtue of Azure Storage and Big Data
Platforms
Unify the Data Engineering and Data Science workloads.
However, a lot of time is spent in engineering systems which are required for orchestrating data pull from sources.
Design Concerns
State Management
Configuration Driven System
Failure Handling
Reconciliation
Data Availability in Consumable Format by Transformation and Data Science Layers
Logging The time spent from the ground up in building such a system is considerable given the design concerns and nuances of the integration across multiple systems
Recoverability
Data Lake Ingestion Framework on Azure
ADF
Azure Databricks/ HDInsights
Azure SQL DB
The framework comes with generic configurable ADF pipelines that have been built, which by virtue of configuration in Azure SQL DB can pull tables from various sources. The state as well sits in Azure SQL DB. State management keeps track of what checkpoint data has the data been pulled. This aids in Incremental Data Pull, Failure Handling and Recoverability.
Azure Databricks
Azure Databricks or HDInsights acts as compute to build data state from incremental pull to expose:
Current Snapshot
Dated partitions
Having the above-computed views, aids in downstream processes (Transformation or Data Science Activities) as there is no need for individual processes to stitch incremental data chunks for a given table and are readily available to be queried for current state and data as of a date.
The framework automates a major chunk of data engineering workload, which is data pull, thus enabling the delivery of streamlined and timely analytics.