Apache Spark is a data processing framework that can perform processing tasks on very large data sets quite swiftly. It can also distribute data processing tasks across several devices, either on its own or in tandem with distinctive computing tools. Today Spark is used by numerous data engineers and data scientists across the world, in organizations of all sizes. It enables them to process data at lightning fast speed for both streaming and batch workloads. Apache Spark can run on YARN, Kubernetes or standalone. It is known to work with an expansive range of data outputs and inputs.
Apache Spark 3.2 was recently released and is now available for anyone who desires to run Spark on Kubernetes. It uses Hadoop 3.3.1 by default, as opposed to Hadoop 3.2.0 that was previously used. Hadoop 3.3.1 provides significant performance improvements, and removes several unnecessary calls to the S3 API. This helps in cutting down the risk of getting throttled, while augmenting the performance of Spark on the whole while reading from S3.
Aggregate push down
Spark applications tend to operate on distributed data coming from disparate data sources. They commonly have to directly query data sources external to Spark, such as data warehouses and backing relational databases. Apache Spark provides Data Source APIs for that. They are a pluggable mechanism for accessing structured data with the assistance of Spark SQL. Data Source APIs is known to be integrated tightly with Spark Optimizer and offers optimizations like filter push down to the column pruning and external data source. They just offer a subset of the functionality that might be pushed down and ultimately executed at the data source.
While reading data from any storage, Apache Sparks uses a library that tends to implement a particular API. A new API called DataSource V2 was released with Spark 2.3 in 2018. The key common data connections were ported to it. DataSource V2 allowed for a high degree of optimization at the data source layer. With Apache Spark 3.2 it becomes possible to benefit from predicate pushdown on queries that select an aggregated column or feature aggregated filter. For instance, it will be faster to compute the number in a parquet file through Spark, as it can directly read from the metadata of the file, rather than scanning it. Aggregate pushdown also helps in minimizing the number of files shuffled or read across the network, as Spark can apply an aggregate predicate to the Data Source. In the situation that several aggregates are included in the query and supported by DataSource, Apache Spark will end up pushing down all the aggregates.
Aggregate functions are commonly used in SQL for computing a single result from a set of input values. The most frequently used aggregate functions include SUM, MIN, AVG, MAX and COUNT. In case the aggregates in SQL statements end up being supported by data source with exact semantics as Spark, and such aggregates can ideally be pushed down to the data source level for improving performance.
- Dramatically reduced network IO between Spark and data source
- Faster aggregate computation in data source because of presence of indexes
Data Source API V2 is one of the most important features coming with Spark 2.3, and effectively helps in reducing the amount of data transferred from the source by applying the filtering conditions during the load.
AD: Apache Spark 3.2 allows for effective Aggregate push down through Data Source API V2
PD: Spark 3.2 brings a host of performance improvements to the framework, especially in DataSource V2. It becomes possible to benefit from predicate pushdown on queries that select an aggregated column or feature aggregated filter in Apache Spark 3.2.