pyspark.RDD.groupBy#
- RDD.groupBy(f, numPartitions=None, partitionFunc=<function portable_hash>)[source]#
Return an RDD of grouped items.
New in version 0.7.0.
- Parameters
- ffunction
a function to compute the key
- numPartitionsint, optional
the number of partitions in new
RDD
- partitionFuncfunction, optional, default portable_hash
a function to compute the partition index
- Returns
Examples
>>> rdd = sc.parallelize([1, 1, 2, 3, 5, 8]) >>> result = rdd.groupBy(lambda x: x % 2).collect() >>> sorted([(x, sorted(y)) for (x, y) in result]) [(0, [2, 8]), (1, [1, 1, 3, 5])]