100分 CDH5 Regionserver假死 canaryGC时间长达十几小时

曹宇 2015-04-20 10:57:21
HI 各位 有个棘手的问题还望能够得到大家的解决.

集群CDH 5.0.5 使用了cloudera manager. HBase 0.96.1.1-cdh5.0.5 Hadoop 2.3.0-cdh5.0.5 系统cent 6.6 2.6.32-504.el6.x86_64

regionserver在使用过程中会突然假死.在CM的界面上提示 Web 服务器状态, 群集连接 该 RegionServer 当前未连接至其 cluster
查看日志 完全没有日志.就在假死的这个时间点 日志就停了..
查看进程 还在 wget localhost:60030/jmx 无法连接 telnet 60030端口可以联通
最主要一点是 CPU使用率高达999% 使用top看 hbase的CPU使用已经爆表...


在等待N久(1小时 1天?)后,regionserver从假死中恢复 打印日志如:
2015-04-19 19:18:12,470 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 159900480ms
GC pool 'ParNew' had collection(s): count=1 time=159900480ms


从CDH的从定向stdout文件中 打印出了GC日志如下:
65218.270: [GC 65218.270: [ParNewSat Apr 18 13:21:17 CST 2015 RS pid:87434  Canary exited with error code 3
Sat Apr 18 13:21:17 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:21:34 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:21:34 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:21:50 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:21:50 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:22:07 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:22:07 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:22:23 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:22:23 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 14:24:33 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 17:07:02 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 17:07:19 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 18:48:48 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 18:49:05 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 19:24:38 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 20:15:30 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 20:15:30 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 21:36:43 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 02:22:03 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 02:22:03 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 02:42:28 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 02:42:28 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 03:13:08 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 03:43:54 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 07:54:39 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 07:54:55 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 08:00:14 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 08:15:42 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 08:15:42 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 09:06:52 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 09:06:52 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 11:24:48 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 11:24:48 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 15:18:35 CST 2015 RS pid:87434 Starting the canary
: 487294K->12030K(536768K), 99812.7450040 secs] 11812728K->11337939K(16717632K), 99812.7451820 secs] [Times: user=1668901.81 sys=5079.83, real=99797.57 secs]


推测可能和canary有关
不过我仔细看日志 发现 应该是 canary的日志 插入到了GC的日志中 如果没插入应该这样.
65218.270: [GC 65218.270: [ParNew: 487294K->12030K(536768K), 99812.7450040 secs] 11812728K->11337939K(16717632K), 99812.7451820 secs] [Times: user=1668901.81 sys=5079.83, real=99797.57 secs]
Sat Apr 18 13:21:17 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:21:17 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:21:34 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:21:34 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:21:50 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:21:50 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:22:07 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:22:07 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 13:22:23 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 13:22:23 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 14:24:33 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 17:07:02 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 17:07:19 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 18:48:48 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 18:49:05 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 19:24:38 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 20:15:30 CST 2015 RS pid:87434 Canary exited with error code 3
Sat Apr 18 20:15:30 CST 2015 RS pid:87434 Starting the canary
Sat Apr 18 21:36:43 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 02:22:03 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 02:22:03 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 02:42:28 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 02:42:28 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 03:13:08 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 03:43:54 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 07:54:39 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 07:54:55 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 08:00:14 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 08:15:42 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 08:15:42 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 09:06:52 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 09:06:52 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 11:24:48 CST 2015 RS pid:87434 Canary exited with error code 3
Sun Apr 19 11:24:48 CST 2015 RS pid:87434 Starting the canary
Sun Apr 19 15:18:35 CST 2015 RS pid:87434 Starting the canary



还望高手支招...;非常感谢...
...全文
1245 7 打赏 收藏 转发到动态 举报
写回复
用AI写文章
7 条回复
切换为时间正序
请发表友善的回复…
发表回复
dengkanghua1989 2016-05-25
  • 打赏
  • 举报
回复
TL;DR: make sure you update your Linux kernels in the near future, or you'll experience some nasty deadlocks. The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. Linux futex_wait() bug... Anything running RHEL 6x or CentOS 6.x is advised to upgrade to the latest kernel (2.6.32-504.16.2 or higher). The post mentions it happens mostly on systems with Intel's Haswell processors ( Xeon E3 v3, Xeon E5 v3, etc ). If you haven't been bitten by this bug, it's probably just a matter of time. Or perhaps you've experienced a service that crashed, couldn't figure out the actual reason and left it at a " meh, I'll just restart it and it'll be fine ", just be done with it. The changelog for the 2.6.32-504.16.2 kernel on CentOS 6.6 mentions this futex fix. $ yum install yum-changelog python-dateutil $ yum changelog all kernel-2.6.32-504.16.2.el6 | grep futex ... - [kernel] futex: Ensure get_futex_key_refs() always implies a barrier (Larry Woodman) [1192107 1167405] ... It's a long shot, but this kernel bug may be the actual reason. http://www.tuicool.com/m/articles/FzEV3af
曹宇 2015-05-13
  • 打赏
  • 举报
回复
多谢各位,最终问题解决了,因为操作系统是CentOS6.6 推断是系统bug 更换操作系统为6.5后问题消失. 但是至于是6.6哪方面和java或者说GC hbase不兼容还未找到.
skyWalker_ONLY 2015-04-29
  • 打赏
  • 举报
回复
建议出现这种现象时,使用jconsole或者jvisualvm监控一下内存的使用情况,及是否在执行gc
UESTC少尉 2015-04-24
  • 打赏
  • 举报
回复
没遇到过,进来看看
乙鱼 2015-04-24
  • 打赏
  • 举报
回复
是不是hbase在做文件合并的工作。
wolf_6 2015-04-22
  • 打赏
  • 举报
回复

刚才那个是 top -Hp pid
wolf_6 2015-04-22
  • 打赏
  • 举报
回复

再补充一张图片

20,808

社区成员

发帖
与我相关
我的任务
社区描述
Hadoop生态大数据交流社区,致力于有Hadoop,hive,Spark,Hbase,Flink,ClickHouse,Kafka,数据仓库,大数据集群运维技术分享和交流等。致力于收集优质的博客
社区管理员
  • 分布式计算/Hadoop社区
  • 涤生大数据
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧