Series 2: Ingesting data to Data Lake

Guidelines For Ingesting Structured And Unstructured Data To Data Lakes

Best practices to consider for building scalable data lakes on cloud

Ingestion involves connecting to various data sources, extracting the data, detecting the changed data and tracking the data flow. Structured, semi-structured and unstructured data is extracted from various heterogeneous source systems and initially loaded as-is into the store. The data ingestion layer is the first step for the data coming from variable sources into the Data Lake Store.

The data ingestion process can be referred to as the system of data flowing from its origin to one or more data stores, including data lakes.

The quality of the ingestion process would majorly correspond with the quality of data featured in the data lake.

If data is ingested incorrectly, then it can prove to be more difficult to analyze downstream, thereby putting the value of your data in jeopardy altogether.

In case the data is ingested properly, it shall arrive on the data lake at the right time, along with the correct fidelity, while being ready for data wrangling and analytic use.

In simple terms, the data ingestion process can be referred to as the system of data flowing from its origin to one or more data stores, including data lakes. This may even include search engines and databases.

The quality of the ingestion process would majorly correspond with the quality of data featured in the data lake.

Data Ingestion can be over in real-time data or batch data. Real- time data ingestion (otherwise called as Streaming Ingestion) for analytical or transactional processing enables businesses to make timely operational decisions that are critical to the success of the organization – while the data is still current.

Real-time data is generally ingested when it originates – for instance via Message Topics or Queues, Change Data Capture.

Data Lake Ingestion Framework (DLIF)

This Data Lake Ingestion Framework accelerates data ingestion and makes the ingested data ready in a form consumable by transformation and modeling layers, thus enabling the delivery of streamlined and timely analytics

Key Challenge

Without the need to iterate, well-designed data lakes open up a plethora of possibilities:

Running cross-functional data analysis

Building reconciliation systems and reporting systems without impacting the transactional system

Practically infinite compute and storage by virtue of Azure Storage and Big Data
Platforms

Unify the Data Engineering and Data Science workloads.

However, a lot of time is spent in engineering systems which are required for orchestrating data pull from sources.

Design Concerns

State Management

Configuration Driven System

Failure Handling

Reconciliation

Data Availability in Consumable Format by Transformation and Data Science Layers

Logging The time spent from the ground up in building such a system is considerable given the design concerns and nuances of the integration across multiple systems

Recoverability

Data Lake Ingestion Framework on Azure

ADF

Azure Databricks/ HDInsights

Azure SQL DB

The framework comes with generic configurable ADF pipelines that have been built, which by virtue of configuration in Azure SQL DB can pull tables from various sources. The state as well sits in Azure SQL DB. State management keeps track of what checkpoint data has the data been pulled. This aids in Incremental Data Pull, Failure Handling and Recoverability.

Azure Databricks

Azure Databricks or HDInsights acts as compute to build data state from incremental pull to expose:

Current Snapshot

Dated partitions

Having the above-computed views, aids in downstream processes (Transformation or Data Science Activities) as there is no need for individual processes to stitch incremental data chunks for a given table and are readily available to be queried for current state and data as of a date.

The framework automates a major chunk of data engineering workload, which is data pull, thus enabling the delivery of streamlined and timely analytics.

Guidelines For Ingesting Structured And Unstructured Data To Data Lakes

Guidelines For Ingesting Structured And Unstructured Data To Data Lakes

Data Lake Ingestion Framework (DLIF)

Key Challenge

Design Concerns

Data Lake Ingestion Framework on Azure

Azure Databricks

About The Author

Leave a reply Cancel reply

Recent Posts

Categories

Guidelines For Ingesting Structured And Unstructured Data To Data Lakes

Guidelines For Ingesting Structured And Unstructured Data To Data Lakes

Data Lake Ingestion Framework (DLIF)

Key Challenge

Design Concerns

Data Lake Ingestion Framework on Azure

Azure Databricks

About The Author

Leave a reply Cancel reply

Recent Posts

Tags

Categories