pyspark.sql.DataFrame.repartition

DataFrame.repartition(numPartitions: Union[int, ColumnOrName], *cols: ColumnOrName) → DataFrame[source]

Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
numPartitionsint

can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.

colsstr or Column

partitioning columns.

Changed in version 1.6.0: Added optional arguments to specify the partitioning columns. Also made numPartitions optional if partitioning columns are specified.

Returns
DataFrame

Repartitioned DataFrame.

Examples

>>> df = spark.createDataFrame(
...     [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])

Repartition the data into 10 partitions.

>>> df.repartition(10).rdd.getNumPartitions()
10

Repartition the data into 7 partitions by ‘age’ column.

>>> df.repartition(7, "age").rdd.getNumPartitions()
7

Repartition the data into 7 partitions by ‘age’ and ‘name columns.

>>> df.repartition(3, "name", "age").rdd.getNumPartitions()
3