Spark RDD flatmap 性能慢,求助

fan020312 2016-05-03 03:26:25
在使用RDD 的flatmap函数时,如果flatmap函数中返回的数组对象很多,比如几十上百,会导致Spark运行特别慢,示例代码:
val rdd2 = sc.textFile(strInputFilePath).flatMap(line => {
var grids = new ArrayBuffer[Tuple2[Tuple2[Int, Int], Double]]()
val values = line.split(",")
if (values.length > 3) {
try {
val br = broadcast.value
val x = values(br._6).toDouble
val y = values(br._7).toDouble

val point = new SPoint2D(x, y)

if (br._1.contains(point)) {
val xmin = point.x - br._2;
val ymin = point.y - br._2;
val xmax = point.x + br._2;
val ymax = point.y + br._2;

val rcBounds = br._1;
val IndexCols = br._4
val IndexRows = br._5

var col1 = (Math.floor((xmin - rcBounds.getLeft())
/ resolution)).toInt;
var col2 = (Math.floor((xmax - rcBounds.getLeft())
/ resolution)).toInt;
var row1 = -(Math.floor((ymax - rcBounds.getTop())
/ resolution)).toInt;
var row2 = -(Math.floor((ymin - rcBounds.getTop())
/ resolution)).toInt;

if (col1 < 0) {
col1 = 0;
} else if (col1 >= IndexCols) {
col1 = IndexCols - 1;
}
if (col2 < 0) {
col2 = 0;
} else if (col2 >= IndexCols) {
col2 = IndexCols - 1;
}
if (row1 < 0) {
row1 = 0;
} else if (row1 >= IndexRows) {
row1 = IndexRows - 1;
}
if (row2 < 0) {
row2 = 0;
} else if (row2 >= IndexRows) {
row2 = IndexRows - 1;
}

if (col1 > col2) {
var temp = col2;
col2 = col1;
col1 = temp;
}
if (row1 > row2) {
var temp = row2;
row2 = row1;
row1 = temp;
}

for (col <- col1.to(col2); row <- row1.to(row2)) {
val xtemp = rcBounds.getLeft() + col * br._3 + 0.5 * br._3;
val ytemp = rcBounds.getTop() - row * br._3 - 0.5 * br._3;

val distance = Math.sqrt((x - xtemp)
* (x - xtemp) + (y - ytemp)
* (y - ytemp));
if (distance <= br._2) {
val disPre = distance / br._2;
val valuePre = 1.0 * 3 * Math.pow(1 - disPre * disPre, 2) / br._8;
grids += new Tuple2((row, col), valuePre)
}
}
}
} catch {
case ex: Exception =>
ex.printStackTrace()
}
}
grids.toArray[Tuple2[Tuple2[Int, Int], Double]]
})

在上面代码中,如果grids这个ArrayBuffer中对象数目比较多,这段代码运行会非常慢。请问各位,有没有好的解决思路和方法。
...全文
982 2 打赏 收藏 转发到动态 举报
写回复
用AI写文章
2 条回复
切换为时间正序
请发表友善的回复…
发表回复
java8964 2017-03-18
  • 打赏
  • 举报
回复
I don't think Spark is slow, but your code is slow. You should benchmark your method (The whole line => { your implementation here }), to see how long it run. If it run for 1s, then assuming that your text file has 1,000,000 lines, you will run 1,000,000 seconds in sequence. So even spark can it as 1000 concurrently, your whole job will still take 1000 seconds. So my suggestion is: 1) Try to benchmark how long your method runs? Try to improve it. I didn't read it carefully, but it looks like it can improve. The code smells bad 2) If there is no way to improve the method, then your place need from each line is: val x = values(br._6).toDouble val y = values(br._7).toDouble So you can distinct the real combination of (values(br._6), values(br._7) first, so each unique combination will only run from your method once, instead of as right now.
zhouwenzhen1437 2017-03-18
  • 打赏
  • 举报
回复
请问你的问题解决了吗?

1,258

社区成员

发帖
与我相关
我的任务
社区描述
Spark由Scala写成,是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行计算框架,Spark基于MapReduce算法实现的分布式计算。
社区管理员
  • Spark
  • shiter
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧