HBase 异常宕机的原因?

the_gunner 2014-07-03 12:38:13
在我们公司的集群中,由于配置不是很高,HBase经常Regionserver 或者 HMaster宕掉,但是不太清楚具体原因。我推测是否是Map Reduce任务与HBase抢系统资源? 因为整个HBase启动起来后,如果不同时进行一些Map Reduce任务的话,是不会出问题的。通常是在执行导入的Map Reduce任务时,容易宕机。

求大神证实一下原因

下面是找到的一些宕机时日志的信息:


2014-07-03 11:35:00,516 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10979ms
2014-07-03 11:35:16,731 WARN [regionserver60020] util.Sleeper: We slept 16189ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-07-03 11:35:16,746 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12933ms
2014-07-03 11:35:29,801 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6797ms
2014-07-03 11:35:31,768 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1466ms
2014-07-03 11:35:31,768 INFO [regionserver60020-SendThread(u07:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 66866ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnect
2014-07-03 11:35:36,592 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4307ms
2014-07-03 11:35:49,857 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 20056ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-07-03 11:35:49,858 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12078ms
2014-07-03 11:35:52,555 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2029ms
2014-07-03 11:35:58,543 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5487ms
2014-07-03 11:36:02,560 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2094ms
2014-07-03 11:36:06,415 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3354ms
2014-07-03 11:36:08,135 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Opening socket connection to server u04/192.168.85.131:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:36:08,288 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1220ms
2014-07-03 11:36:11,092 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Socket connection established to u04/192.168.85.131:2181, initiating session
2014-07-03 11:36:12,170 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnect
2014-07-03 11:36:15,675 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Opening socket connection to server u02/192.168.85.129:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:36:15,686 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Socket connection established to u02/192.168.85.129:2181, initiating session
2014-07-03 11:36:25,135 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3709ms
2014-07-03 11:36:34,095 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8459ms
2014-07-03 11:36:34,098 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 17147ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnect
2014-07-03 11:36:34,772 DEBUG [LruStats #0] hfile.LruBlockCache: Total=1.90 MB, free=402.60 MB, max=404.50 MB, blocks=0, accesses=42069, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=0, evicted=0, evictedPerRun=NaN
2014-07-03 11:36:38,213 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Opening socket connection to server u03/192.168.85.130:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:36:38,804 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Socket connection established to u03/192.168.85.130:2181, initiating session
2014-07-03 11:36:51,474 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12168ms
2014-07-03 11:36:53,638 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnect
2014-07-03 11:37:08,385 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4707ms
2014-07-03 11:37:09,333 WARN [regionserver60020] util.Sleeper: We slept 14746ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-07-03 11:37:09,963 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Opening socket connection to server u05/192.168.85.132:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:37:11,451 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Socket connection established to u05/192.168.85.132:2181, initiating session
2014-07-03 11:37:16,208 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2610ms
2014-07-03 11:37:20,313 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x646f9e22a350000 has expired, closing socket connection
2014-07-03 11:37:20,388 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: ABORTING region server u05,60020,1404351691231: regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000, quorum=u04:2181,u03:2181,u02:2181,u01:2181,u08:2181,u07:2181,u06:2181,u05:2181, baseZNode=/hbase regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000 received expired from ZooKeeper, aborting
2014-07-03 11:37:29,418 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
...全文
1499 3 打赏 收藏 转发到动态 举报
AI 作业
写回复
用AI写文章
3 条回复
切换为时间正序
请发表友善的回复…
发表回复
qingyuan18 2014-07-22
  • 打赏
  • 举报
回复
是否有javacore文件或者threaddump文件?
vah101 2014-07-21
  • 打赏
  • 举报
回复
gc造成程序连接zookeeper超时导致“received expired from ZooKeeper”,最后挂掉了 修改配置文件:hbase-env.sh,配置好heap size,还有gc选项 参考:http://wiki.apache.org/hadoop/PerformanceTuning http://www.cnblogs.com/cenyuhai/p/3235101.html
weitao1234 2014-07-19
  • 打赏
  • 举报
回复
垃圾回收站进行GC时Regionserver 或者 HMaster宕掉,可以尝试将JVM的内存调大。
内容概要:本文介绍了阿里HBase为实现高可用性所采取的一系列措施和技术手段。首先阐述了HBase的基本特性,如高可靠性、易伸缩性和高灵活性。接着详细解释了高可用性的定义及其衡量标准,包括可用率、请求失败率和RegionServer宕机比例等指标。针对节点宕机、大请求等问题,提出了RS宕机自动恢复、分钟级快速宕机恢复、线程阻塞优化、大集群下的多租户隔离以及调度优化等解决方案。此外,还讨论了异地灾备、集群灾备、HBase主备集群数据同步、秒级无缝灾备切换、ZK耦合灾备部署、去耦合高可用灾备部署和多集群混合Replication等技术。最后总结了高可用之路的关键成果,包括分钟级单机恢复、秒级宕机反应、大集群下的多租户隔离、资源调度优化、去耦合高可用同城灾备和低成本高灵活异地灾备。 适合人群:对分布式数据库有研究兴趣或实际需求的技术人员,尤其是从事Nosql数据库运维、架构设计的相关人员。 使用场景及目标:适用于需要构建高可用性HBase集群的企业和个人开发者,旨在帮助他们理解并掌握如何通过一系列技术和策略提高HBase集群的稳定性、可靠性和容灾能力,确保在面对各种异常情况时仍能保持高效运行。 其他说明:文中强调了自动化和虚拟化对于提升资源利用率、加速资源调度和增强资源隔离的重要性,鼓励读者关注这些领域的最新进展。同时提供了联系邮箱lindou.liu@alibaba-inc.com,欢迎有兴趣的读者进一步交流探讨。

20,848

社区成员

发帖
与我相关
我的任务
社区描述
Hadoop生态大数据交流社区,致力于有Hadoop,hive,Spark,Hbase,Flink,ClickHouse,Kafka,数据仓库,大数据集群运维技术分享和交流等。致力于收集优质的博客
社区管理员
  • 分布式计算/Hadoop社区
  • 涤生大数据
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧