Spark运行卡住，求助一下各位大神

zhouwenzhen1437 2017-03-26 10:31:17

我的集群配置为7台，其中5台机子都是8g内存，另外两台为虚拟机。
在别写程序之后通过spark-submit进行提交，可以成功跑完。但是今天在进行重跑的时候出现了一个问题，问题如下：
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_1730_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_1732_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_1733_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
一直卡在这个地方，我尝试过很多方法都没办法解决，可能不知道出现的原因，所以需要哪位大神看看，给点建议，谢谢了

...全文

4293 17 打赏收藏转发到动态举报

写回复

用AI写文章

17 条回复

切换为时间正序

请发表友善的回复…

发表回复

peter_pes 2018-07-23

打赏
举报

https://blog.csdn.net/lingbo229/article/details/80914283

Luis_yao 2017-08-15

打赏
举报

引用 15 楼 javahuoshan 的回复:

--num-executors 100 \ --driver-memory 6g \ --executor-memory 6g \ --executor-cores 8 \ 100个executors 一个executor-memory 6G内存 8核cpu 那得多少内存多少cpu啊

600g内存，800个核，集群资源远远不够啊

火山1 2017-08-09

打赏
举报

--num-executors 100 \ --driver-memory 6g \ --executor-memory 6g \ --executor-cores 8 \ 100个executors 一个executor-memory 6G内存 8核cpu 那得多少内存多少cpu啊

tchqiq 2017-04-13

打赏
举报

为啥不在csdn贴图呢...两个地方来回切.... 我看了下ui截图,感觉和shuffle无关,没有数据倾斜,是不是就是数据量大的,资源不足的原因啊. 你要分析下到底卡在哪个stage了,然后才能具体的分析哪块代码效率不高啊

zhouwenzhen1437 2017-04-08

打赏
举报

我在知乎上也进行了提问，并提供了源码，麻烦大神看看 https://www.zhihu.com/question/57772280?guide=1

tchqiq 2017-04-05

打赏
举报

你上边设置的参数可以提高shuffle的稳定性,所以是跑成功了.如果要增大shuffle使用executor内存可以调下边两个参数 num-executors 100 --这个调小 spark.shuffle.memoryFraction --这个调大不知道你具体慢在哪了,所以没法给你具体的优化建议.你采用的是hashshuffle吗? consolidateFiles这个参数是hashshuffle的时候用的,要不改成SortShuffle试试,一般慢都慢在shuffle上了

zhouwenzhen1437 2017-04-05

打赏
举报

SparkConf sc = new SparkConf().setAppName("SparkCalculateSR").set("spark.storage.memoryFraction", "0.2") .set("spark.default.parallelism", "20") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.shuffle.consolidateFiles", "true").set("spark.reducer.maxSizeInFlight", "100m") .set("spark.shuffle.file.buffer", "100k").set("spark.shuffle.io.maxRetries", "10") .set("spark.shuffle.io.retryWait", "10s"); 我设置了这些参数，在添加内存之后可以跑完，但是很慢很慢很慢，无法忍受，请大神再指点指点

tchqiq 2017-04-03

打赏
举报

1. 因为一台机器的内存分配给越多的executor，每个executor的内存就越小，以致出现过多的数据spill over甚至out of memory的情况。 2. 把这个参数调大些试试:spark.shuffle.memoryFraction * 参数说明：该参数用于设置shuffle过程中一个task拉取到上个stage的task的输出后，进行聚合操作时能够使用的Executor内存的比例，默认是0.2。也就是说，Executor默认只有20%的内存用来进行该操作。shuffle操作在进行聚合时，如果发现使用的内存超出了这个20%的限制，那么多余的数据就会溢写到磁盘文件中去，此时就会极大地降低性能。 * 参数调优建议：如果Spark作业中的RDD持久化操作较少，shuffle操作较多时，建议降低持久化操作的内存占比，提高shuffle操作的内存占比比例，避免shuffle过程中数据过多时内存不够用，必须溢写到磁盘上，降低了性能。此外，如果发现作业由于频繁的gc导致运行缓慢，意味着task执行用户代码的内存不够用，那么同样建议调低这个参数的值。

zhouwenzhen1437 2017-03-28

打赏
举报

我尝试的将两台虚拟机关闭，再跑一次，还是遇到同样的问题，卡在一个stages上，这个stages是执行saveAsTextFile的。

meichuntao 2017-03-28

打赏
举报

我觉得可能是你的虚机造成的，尽管你指定executor-memory为6G，但是你虚机实际上只能有1G。当你的计算需要大量内存时，在虚机上就只能不停的溢写了。从网页的4040端口可以看（有可能是404X），到底卡在哪个任务哪个executor。

zhouwenzhen1437 2017-03-28

打赏
举报

我还发现，程序刚开始跑的使用CPU的占用率比较正常，维持在1~30%左右，一旦到卡住的地方，占有率爆炸式增长，最高可达到790%，很有可能就是内存的问题，求大神们指点指点

zhouwenzhen1437 2017-03-28

打赏
举报

在数据量小的情况下可以很快跑完，但是数据量一大就卡了，卡住这个点可能数个小时，内存应该是足够的，因为之前也成功跑过几次

zhouwenzhen1437 2017-03-28

打赏
举报

spark 2.1 下面是sumbit提交的内容 /root/spark-2.1.0-bin-hadoop2.6/bin/spark-submit \ --class com.sirc.zwz.CSRJava.ChangeDataStruction.SCSR \ --num-executors 100 \ --driver-memory 6g \ --executor-memory 6g \ --executor-cores 8 \ /root/jars/SparkCSR_JAVA-0.0.1-SNAPSHOT.jar \ 7台集群，1台master，6台slave，其中4台各8g内存，可提供spark运行的最大内存为6g（每台），另外2台是虚拟机各2g内存，各提供1g进行计算下面是部分日志信息： primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW]]} 2017-03-28 11:21:23,382 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000017_21/part-00017. BP-2089499914-196.168.168.100-1490492430641 blk_1073742807_1983{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW]]} 2017-03-28 11:21:23,459 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000029_33/part-00029. BP-2089499914-196.168.168.100-1490492430641 blk_1073742808_1984{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW], ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW]]} 2017-03-28 11:21:23,509 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000026_30/part-00026. BP-2089499914-196.168.168.100-1490492430641 blk_1073742809_1985{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-97e7de7f-fbcd-44bb-821d-4d245f1ce82c:NORMAL:196.168.168.101:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW]]} 2017-03-28 11:21:23,513 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000009_13/part-00009. BP-2089499914-196.168.168.100-1490492430641 blk_1073742810_1986{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW], ReplicaUnderConstruction[[DISK]DS-99ba79bc-da18-4d0d-9a2c-b7b367cbea66:NORMAL:196.168.168.29:50010|RBW]]} 2017-03-28 11:21:23,521 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000013_17/part-00013. BP-2089499914-196.168.168.100-1490492430641 blk_1073742811_1987{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW]]} 2017-03-28 11:21:24,090 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170328112029_0002_m_000019_23/part-00019. BP-2089499914-196.168.168.100-1490492430641 blk_1073742812_1988{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW]]} 2017-03-28 11:27:42,734 INFO BlockStateChange: BLOCK* processReport: from storage DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da node DatanodeRegistration(196.168.168.102, datanodeUuid=5407fb12-70a4-48d2-ac27-813a7833434c, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-18972982-c034-4dd4-b10b-d6563325e4cb;nsid=220744474;c=0), blocks: 20, hasStaleStorages: false, processing time: 1 msecs 2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 196.168.168.100 2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs 2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 7487 2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 360 Total time for transactions(ms): 33 Number of transactions batched in Syncs: 17 Number of syncs: 147 SyncTimes(ms): 1310 712 2017-03-28 12:15:58,160 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 360 Total time for transactions(ms): 33 Number of transactions batched in Syncs: 17 Number of syncs: 148 SyncTimes(ms): 1328 716 2017-03-28 12:15:58,161 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name1/current/edits_inprogress_0000000000000007487 -> /root/hadoop/hadoop-2.6.5/name1/current/edits_0000000000000007487-0000000000000007846 2017-03-28 12:15:58,161 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name2/current/edits_inprogress_0000000000000007487 -> /root/hadoop/hadoop-2.6.5/name2/current/edits_0000000000000007487-0000000000000007846 2017-03-28 12:15:58,161 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 7847 2017-03-28 12:15:58,551 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.08s at 139.24 KB/s 2017-03-28 12:15:58,551 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000007846 size 12281 bytes. 2017-03-28 12:15:58,618 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 7486 2017-03-28 12:15:58,618 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name1/current/fsimage_0000000000000007447, cpktTxId=0000000000000007447) 2017-03-28 12:15:58,619 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name2/current/fsimage_0000000000000007447, cpktTxId=0000000000000007447) 2017-03-28 12:48:56,541 INFO BlockStateChange: BLOCK* processReport: from storage DS-99ba79bc-da18-4d0d-9a2c-b7b367cbea66 node DatanodeRegistration(196.168.168.29, datanodeUuid=9efd8c9e-162c-4c45-af71-bf33f49ad408, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-18972982-c034-4dd4-b10b-d6563325e4cb;nsid=220744474;c=0), blocks: 13, hasStaleStorages: false, processing time: 1 msecs 2017-03-28 13:15:58,890 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 196.168.168.100 2017-03-28 13:15:58,890 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs 2017-03-28 13:15:58,890 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 7847 2017-03-28 13:15:58,891 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 70 46 2017-03-28 13:15:58,948 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 3 SyncTimes(ms): 105 68 2017-03-28 13:15:58,949 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name1/current/edits_inprogress_0000000000000007847 -> /root/hadoop/hadoop-2.6.5/name1/current/edits_0000000000000007847-0000000000000007848 2017-03-28 13:15:58,950 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name2/current/edits_inprogress_0000000000000007847 -> /root/hadoop/hadoop-2.6.5/name2/current/edits_0000000000000007847-0000000000000007848 2017-03-28 13:15:58,951 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 7849 2017-03-28 13:15:59,856 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.20s at 55.28 KB/s 2017-03-28 13:15:59,856 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000007848 size 12281 bytes. 2017-03-28 13:16:00,041 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 7846 2017-03-28 13:16:00,041 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name1/current/fsimage_0000000000000007486, cpktTxId=0000000000000007486) 2017-03-28 13:16:00,041 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name2/current/fsimage_0000000000000007486, cpktTxId=0000000000000007486) 2017-03-28 13:43:11,290 INFO logs: Aliases are enabled

meichuntao 2017-03-28

打赏
举报

楼主提供的信息不全啊。首先spark版本号，应用代码，卡在哪个task，内存配置情况？看样子像是内存不足频繁写磁盘造成的。

zhouwenzhen1437 2017-03-27

打赏
举报

还没呢，一直没解决，好疑惑，需要大神来拯救一把

小胖胖蒋 2017-03-27

打赏
举报

我也遇到这个问题，在这一步卡了很长时间。。楼主解决了吗

zhouwenzhen1437 2017-03-26

打赏
举报

通过查看日志发现 17/03/25 22:52:32 INFO ExternalSorter: Thread 82 spilling in-memory map of 473.6 MB to disk (25 times so far) 17/03/25 22:52:37 INFO ExternalSorter: Thread 71 spilling in-memory map of 392.0 MB to disk (26 times so far) 17/03/25 22:52:52 INFO ExternalSorter: Thread 80 spilling in-memory map of 392.0 MB to disk (22 times so far) 17/03/25 22:53:07 INFO ExternalSorter: Thread 70 spilling in-memory map of 392.0 MB to disk (24 times so far) 17/03/25 22:53:38 INFO ExternalSorter: Thread 79 spilling in-memory map of 401.9 MB to disk (27 times so far) 17/03/25 22:53:49 INFO ExternalSorter: Thread 83 spilling in-memory map of 416.0 MB to disk (24 times so far) 17/03/25 22:53:53 INFO ExternalSorter: Thread 82 spilling in-memory map of 396.8 MB to disk (26 times so far) 不知道怎么解决？

spark集群中的节点可以只处理自身独立数据库里的数据，然后汇总吗？修改我将spark搭建在两台机器上，其中一台既是master又是slave，另一台是slave，两台机器上均装有独立的mongodb数据库。我是否可以让它们只统计自身数据库的内容，然后将结果汇总到一台服务器上的数据库里？目前我的代码如下，但是最终只统计了master里的数据，另一个worker没有统计上。 ...

我编码写出来了，但是解码有点问题，导致提交会报错。罢了，罢了去江边走走，吹吹晚风吧卷不动了，无欲无求了普通人这一生如果没有贵人相助真的很难做一个平凡的人生不逢时也好，技不如人也罢，无所谓了即使很想要一个好的结果。现在手里就剩这俩了，耀子下周要签，华子还在池子重中狠狠的泡，之前在华子实习过，也不知道能不能捞起来，现在是签耀子还是等华子，有点纠结。助理精益工程师岗，8K x 14薪，两个月年终，平均每天加班两小时，加班费是1.5倍，按照工资一半的系数来算，也就是8000一半4000/22.5=177/8=

4. 说一说评论表怎么设计的吧？本人目前的情况是双非二本，打算暑期找实习学习情况为：学完了hadoop，hive，Hbase，Mysql，打算在6月份之前学完spark并做完两个项目但在牛客上。后续题目的出题人题解和参考代码F思路本题改编自********* ************** ，是在考研书籍上看到的，恰巧看到史记上的这句话，加上ioi赛制。昨天还有一个特别搞的事情忘记说了，还是前天巡查的时候发生的，因为早上走的急，又有一个录制课程的任务莫名其妙落在我身上，所以我把录制打开后就直接溜了，结果！

想加入作业帮的友友们！刚实习一个月寒假还能在实习一个多月当初说好的实习六个月，下学期事情有点多，学校不在上海，太不方便了，三四月份也要投暑假实习了，该怎么说比较好，怕被部门拉黑，给差。1.八大政务咨询，已签两方，未来可走政务关系道路，但较难跳到甲方，待遇10k*132.邮储北分客户经理，差额体检，提供宿舍，底薪较低3.京东采销，已oc，据说下。中石化中原石油工程有限公司的石油工程技术应用研究岗值得去么，没想到被递补上去了，以为都没了，各位友友帮我分析一下，我自己打听了一下说这是个乙方公司，工作艰苦，我。

学术废物，需要一篇期刊才能毕业，求助大佬们导师不让发版面费太贵的，SCI三四区都行，但是要非OA的，外卖做完了，但jwt和oss等那些常用技术的工具类根本不会写，使用流程有时候也记不清，需要看一眼才会。从没投递过某公司，但莫名有人加我微信，说是某公司大数据开发部，招聘大数据开发实习生，问我是否感兴趣。在Webpack中，插件（plugin）是用来扩展和定制构建过程的工具，可以用于处理和优化资源、自动。优化Webpack的构建速度是一个常见的需求，下面是一些常见的优化策略：https://www.no。