5,762
社区成员




reduceByKey算子是对K-V型RDD中的元素按K进行分组,调用函数对同一分组中的V进行聚合处理,生成一个新的K-V型RDD,新RDD中的元素个数是源RDD的分组数。reduceByKey算子的定义如下:
- def reduceByKey(
- self: "RDD[Tuple[K, V]]",
- func: Callable[[V, V], V],
- numPartitions: Optional[int] = None,
- partitionFunc: Callable[[K], int] = portable_hash,
- ) -> "RDD[Tuple[K, V]]"
案例:
- rdd1 = sc.parallelize(["Hello Python", "Hello Spark You", "Hello Python Spark", "You know PySpark"])
- # 构造一个K-V型RDD
- rdd2 = rdd1.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1))
- rdd3 = rdd2.reduceByKey(lambda a, b: a + b)
-
- print("源K-V型RDD是:", rdd2.collect())
- print("新K-V型RDD是:", rdd3.collect())