5,750
社区成员




partitionBy算子是对K-V型RDD中的所有元素调用一个函数对K值进行处理,根据函数的返回值对RDD进行重新分区,生成一个新的K-V型RDD。partitionBy算子的定义如下:
def partitionBy(
self: "RDD[Tuple[K, V]]",
numPartitions: Optional[int],
partitionFunc: Callable[[K], int] = portable_hash,
) -> "RDD[Tuple[K, V]]"
案例:
def func(key):
if "Spark" in key.split(" "):
return 0
if "Python" in key.split(" "):
return 1
return 5
rdd1 = sc.parallelize(["Hello Python", "Hello Spark You", "Hello Python Spark", "You know PySpark"])
rdd2 = rdd1.zipWithIndex()
print("RDD2的分区情况是:", rdd2.glom().collect())
rdd3 = rdd2.partitionBy(3, func)
print("RDD3的分区情况是:", rdd3.glom().collect())