Finally, in June 2020, Spark released its new version. Spark 3.0. It has made advancements from its previous version. The Spark 2.x. version. This new version comes loaded with new features that improve functionality, fixes more bugs, and most importantly brings more performance.
Today Apache Spark has joined the most essential technologies in the Big Data domain. There has been a sharp increase in the sheer volume of applications operating in production settings using Spark because it extends a much faster unified processing engine for a humongous volume of data. Big Data is turning out to be an exciting field with endless possibilities, take up Spark Training to master Big Data processing to extract insights for business to fuel their growth in this competitive market.
There are a host of new features in Apache Spark 3.0. We will discuss some of the exciting features of Spark 3.0. here. Read on to find more about these features.
Enhancements to Pandas UDF (User-Defined Function) API
The addition of Pandas UDF is considered the best feature attached to date since Spark 2.3. This feature helps users to take advantage of Pandas API within Spark. They also released a modern interface of Pandas UDF that now comes with Python-type hints. Users of Spark had lots of issues and confusion in all the earlier Spark versions because they weren’t that uniform and simple to follow or use.
So the creators of Spark came up with a new version which fixes these issues with a new interface and lots of features that will eliminate many lingering confusions among the developers working in Spark.
As of now, in Pandas UDFs, four different cases are supported:
- Series – > Series
- Iterator of multiple Series – > Iterator of Series
- Series – > Scalar
- Iterator of Series – > Iterator of Series
Many consider this is a good initial point, but there is a long way to go from here, as they need more community support to release more type hints, as it’s very limited now.
Enhancements in Adaptive Query Execution (AQE)
For Spark to run effectively, runtime adaptivity is significant, as execution plans are optimized based on the input data. An important thing to note here is that data does affect the full effectiveness of the application.
In the new version, two new improvements above AQE helps to interpret tune Spark parameter even more:
- AQE consolidates small partitions to help users avoid worrying regarding shuffle partitions as they adjust dynamically during runtime.
- Once Data Skewness is detected, the AQE breaks partitions down into smaller components.
Structured Streaming has got a distinct user interface.
The Web UI in the new version of Spark now arrives with an additional tab. This tab is for Structured Streaming. Monitoring Streaming jobs is now simplified through these.
At present, a statistics page in distinct streaming query contains five different metrics, and they are:
Process Rate, Batch Duration, Input Rate, Input Row, and Operation Duration.
Many new in-built functions were added.
The latest version of Apache Spark arrives with lots of built-in features and functions. Some of them are hyperbolic functions, CSV operations, bit counts, date, interval, timestamp, etc. There are more than 30 functions added to this new version of Apache Spark. You can learn more about Spark by checking out Spark Tutorial.
With sufficient experience and expertise, we have now come to conclude that building an ML or AI model is not difficult, but building an accurate one that’s difficult. Because there is a need for a large volume of data for training the model. The most prominent reason for delaying the advancements of these AI/ML Models is the compatibility issue between frameworks for Data processing and distributed frameworks for Deep Learning.
Apache Spark will split jobs into many independent tasks. Many frameworks for Deep Learning utilize distinct logic for execution. In light of these, Apache Spark started a new initiative, Project Hydrogen, that tries to integrate the processing of Big Data and ML Model Training. Project Hydrogen splits into three principal sub-sections, and they are:
- Enhanced Data Exchange
- Barrier Execution Mode
- Scheduling that is Accelerator-aware
This new version of Spark has an improved scheduler due to which cluster manager has become accelerator-aware. As you might be knowing, frameworks for Deep Learning are subjected to GPUs (accelerator) that will work towards accelerating workloads. Spark in the current version can detect free GPUs and can assign proper tasks.
And this wasn’t available in earlier versions, as Spark was not aware of the available GPUs in the cluster, hence its users struggled to prepare and process data and then resorted to using other solutions to train their models. Initially, the Barrier execution mode was available from the 2.4 version, and the developmental works for another two subsections were ongoing.
Spark has also come up with a newer version, Apache Spark 3.0.1. that fixed many stability issues of version 3.0. Be aware when you start using a newer version because they may contain bugs and have performance or stability issues.