Apache Spark is an open-source, distributed processing system that is used for big data workloads. It makes use of in-memory caching and optimized query execution to facilitate fast queries against data of any size. Spark essentially runs on memory (RAM), which makes the processing much faster than on disk drives.
It additionally can be used for several things like running distributed SQL, creating data pipelines, ingesting data into a database, running Machine Learning algorithms, as well as working with graphs or data streams.
Apache Spark 3.0 runtime is now available in Azure Synapse. Building on the top of Microsoft specific enhancements and existing open source, this version includes certain unique improvements. The combination of those enhancements leads to significantly accelerated processing capability than the open-source Spark 3.0.2 and 2.4.:
- Performance Improvements: When it comes to evaluating performance in large scale distributed systems, ‘ doing the same with less’ or ‘ doing more with the same’ are always the prime measures. Apart from the various Azure Synapse performance improvements, Spark 3 brings various new enhancements that boost the performance and allows engineering teams to elevate their work.
Predicate Pushdown and competent Shuffle Management build on the common performance patterns/optimizations that are ideally included in the new releases. Azure Synapse specific optimizations in these domains have essentially been ported over to level-up enhancements that come with Spark 3.
- Adaptive Query Execution (AQE): This is an attribute of data processing jobs that are run by data-intensive platforms like Apache Spark, which tends to make them different from various traditional processing systems like relational databases. It is common for queries/data processing steps to take hours or even days to run in Spark, depending on the volume of data. This becomes quite a challenge in certain cases.
Over a span of few days, the query plan shape might change as the estimates of data volume, cardinality, skew, and more are replaced with actual measurements. Adaptive Query Execution (AQE) in Azure Synapse is renowned for providing a framework for dynamic optimization that has the capacity to bring major performance improvements to Spark workloads, while saving them valuable time of the data and performance engineering teams by automating manual tasks.
AQE tends to help out with:
- Shuffle partition tuning: This is a key source of manual work data that companies have to deal with today.
- Join strategy optimization: This needs human review and an in-depth knowledge of query optimization for the purpose of tuning the types of joins that are used based on actual data, rather than any estimates.
- Dynamic partition pruning:
Eradicating the reading of certain partitions is one of the common optimizations in high-scale query processors. However, not all the partition elimination can be done as a part of the query optimization. A few of them require execution time optimization.This feature is so vital to the performance capabilities that a version of it has been added to the Apache Spark 2.4 codebase used in Azure Synapse. This is also built into the Spark 3.0 runtime now available in Azure Synapse.