Apache Spark is an open source analytics engine meant for big data workloads. It can handle both batches as well as real-time analytics and data processing workloads. Spark offers native bindings for programming languages like R, Scala, Python, and Java. It even comes equipped with multiple libraries to support build applications for stream processing, graph processing and machine learning.
Databricks essentially is a Unified Analytics Platform on top of Apache Spark that boosts innovation by unifying the aspects of business, engineering, and data science. It incorporates an integrated workspace for visualization and exploration, where users can work, collaborate and learn in a single, easy to use environment.
Introducing Apache Spark™ 3.2
Recently the availability of Apache Spark™ 3.2 on Databricks was announced, as a part of Databricks Runtime 10.0. Over the years, Spark has emerged as one of the most extensively used engines for executing ML, data science and data engineering on single-node machines or clusters. With managed Spark clusters in the cloud, it becomes easy for people to provision clusters just with a couple of clicks.
The Apache Spark™ 3.2 has been launched with the aim of making Spark more scalable, swift and unified than before. It extends the scope of functions carried out by this engine with certain new, innovative features. This includes:
- The Apache Spark™ 3.2 has been launched with the aim of making Spark more scalable, swift and unified than before. It extends the scope of functions carried out by this engine with certain new, innovative features. This includes:
- Pandas API has been introduced on Apache Spark™ 3.2 to unify big data API and small data API
- It allows for ANSI SQL compatibility mode that simplifies migration of SQL workloads
Speeding up Spark SQL at runtime
Adaptive Query Execution (AQE) is enabled by default Apache Spark™ 3.2. AQE can effectively re-optimize query execution plans for improved performances, on the basis of accurate statistics collected at the runtime.
Pre-collection and maintenance of statistics can be quite extensive in big data. Lack of accurate statistics may lead to inadequate plans, regardless of how advanced the optimizer has been used. In this situation, AQE becomes absolutely compatible with all the existing techniques of query optimization like Dynamic Partition Pruning to shuffle partition coalescence and re-optimize the join strategies.
More scalable state processing streaming
The default implementation of state store in Structured Streaming is not scalable due to the limitations associated with the heap size of the executors. Apache Spark™ 3.2 allows for RocksDB-based state store implementation. This has been used in Databricks production for several years and can effectively serve data from the disk without having to depend on the heap size of executors.
Apart from the addition of the new features, Apache Spark™ 3.2 release majorly focuses on the factors of refinement, usability and stability.