As a company recognizes the power of machine learning (ML) and data science, it will be in the position to explore diverse tools needed to improve their business efficiency, predict changes and augment customer experiences. For the purpose of achieving these goals in business-critical use cases, they shall require a reliable and consistent pattern for reproducing results, tracking experiments, as well as deploying ML models into production. This is where Azure Databricks comes in, and forms the core of the architecture. The ML platform is known as ML flow and storage layer Delta Lake also plays an important role in the process. All of these components seamlessly integrate with multiple services like Azure Machine Learning, Azure Kubernetes Service (AKS) and Azure Data Lake Storage to deliver a competent solution for data science and machine learning. This solution is:
- Simple: Having an open data lake simplifies the architecture. The curate layer in the data lake, Delta Lake, helps in providing access to data effectively in an open-source format.
- Open: This solution provides support for open frameworks, opens standards and open-source code, and reduces the need for any future updates. ML and Azure Databricks natively support Delta Lake and MLflow. These components offer industry-leading ML operations or DevOps for machine learning. There is a wide range of deployment tools integrated with the standardized model format of the solution.
- Collaborative: ML operations (MLOps) and Data Science teams work together with this solution. MLflow Tracking is used by these teams for the purpose of recording and querying experiments. They also deploy models to the central MLflow model registry. Deployed models in streaming pipelines, extract-transform-load (ETL) processes and data ingestion are subsequently used by data engineers.
The process
- Code from multiple libraries, frameworks and languages are used to prepare, refine and cleanse raw data. Its coding possibilities include Koalas, pandas, Spark, SQL, R and Python.
- Azure Databricks builds and trains ML models and runs data science workloads. Pre-installed and optimized libraries are used by Azure Databricks.
- MLflow tracking aids in capturing model runs, ML experiments and results. MLflow integration with Azure Databricks allows for an easy way for tracking experiments, storing models in repositories, as well as making models available to other services. As the best model is ready for production, it is deployed by Azure Databricks to the MLflow model repository. It basically is a centralized registry meant for storing information on production models, and also makes models available to other components:
- Python and Spark pipelines can handle streaming ETL processes or batch workloads, and ingest models. REST APIs tend to offer access to models for varying purposes like testing and interactive scoring in mobile and web applications.
The core of the Azure Databricks architecture offers massive processing power for ML and data science workloads. It also allows for native integration capabilities with different Azure data services, as well as ML runtime environments.