Dataset repartition in Apache Spark

As we know, Apache Spark is one of the fastest big data computational frameworks and it gives the best performance if the data is distributed evenly across nodes or executors. But, we cannot guarantee the partitions in intermittent stages of application are evenly distributed and it impacts the performance of the job. To tackle such issues, we have a few methods which help in redistributing data across partitions and executing the job faster.

First, we take a look at all the different methods and constructors we have which does the repartition -

def repartition(numPartitions: Int): Dataset[T]

This returns a new Dataset that has exactly numPartitions partitions.

def repartition(partitionExprs: Column*): Dataset[T]

This returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as the number of partitions. The resulting Dataset is hash partitioned.

def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.

The all above methods work efficiently if we want to increase the number of partitions or just shuffle the data and keep the same number of partitions.

If we want to reduce the number of partitions after shuffling the data, coalesce() is the best way to do.

def coalesce(numPartitions: Int): Dataset[T]

Returns a new Dataset that has exactly numPartitions partitions, when fewer partitions are requested than the input Dataset.

Dataset repartition() vs coalesce()

We can increase or decrease the number of in-memory using the repartition() method but we can only decrease the number of in-memory partitions using coalesce(). (Although decreasing partitions using repartition() is not efficient)
By calling the repartition() method, it does a full shuffle of data between the executors and also it may create new partitions. In coalesce(), we do not shuffle data fully hence it won't create new partitions.
The data get nearly evenly partitioned after repartition() but as coalesce() avoids full shuffle, the partition sizes vary by a high degree.
Since a full shuffle takes place, repartition is less performant than coalesce.

There is another DataFrame API method, repartitionByRange() does almost the same thing as the repartition() method, but it partitions the data based on a range of column values.

In summary, using data repartitioning in Apache Spark, we reduce the data skewness to achieve the optimum performance.