HBase 异常宕机的原因？

the_gunner 2014-07-03 12:38:13

在我们公司的集群中，由于配置不是很高，HBase经常Regionserver 或者 HMaster宕掉，但是不太清楚具体原因。我推测是否是Map Reduce任务与HBase抢系统资源？因为整个HBase启动起来后，如果不同时进行一些Map Reduce任务的话，是不会出问题的。通常是在执行导入的Map Reduce任务时，容易宕机。

求大神证实一下原因

下面是找到的一些宕机时日志的信息:



2014-07-03 11:35:00,516 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10979ms

2014-07-03 11:35:16,731 WARN  [regionserver60020] util.Sleeper: We slept 16189ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2014-07-03 11:35:16,746 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12933ms

2014-07-03 11:35:29,801 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6797ms

2014-07-03 11:35:31,768 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1466ms

2014-07-03 11:35:31,768 INFO  [regionserver60020-SendThread(u07:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 66866ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnect

2014-07-03 11:35:36,592 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4307ms

2014-07-03 11:35:49,857 WARN  [regionserver60020.periodicFlusher] util.Sleeper: We slept 20056ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2014-07-03 11:35:49,858 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12078ms

2014-07-03 11:35:52,555 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2029ms

2014-07-03 11:35:58,543 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5487ms

2014-07-03 11:36:02,560 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2094ms

2014-07-03 11:36:06,415 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3354ms

2014-07-03 11:36:08,135 INFO  [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Opening socket connection to server u04/192.168.85.131:2181. Will not attempt to authenticate using SASL (无法定位登录配置)

2014-07-03 11:36:08,288 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1220ms

2014-07-03 11:36:11,092 INFO  [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Socket connection established to u04/192.168.85.131:2181, initiating session

2014-07-03 11:36:12,170 INFO  [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnect

2014-07-03 11:36:15,675 INFO  [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Opening socket connection to server u02/192.168.85.129:2181. Will not attempt to authenticate using SASL (无法定位登录配置)

2014-07-03 11:36:15,686 INFO  [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Socket connection established to u02/192.168.85.129:2181, initiating session

2014-07-03 11:36:25,135 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3709ms

2014-07-03 11:36:34,095 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8459ms

2014-07-03 11:36:34,098 INFO  [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 17147ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnect

2014-07-03 11:36:34,772 DEBUG [LruStats #0] hfile.LruBlockCache: Total=1.90 MB, free=402.60 MB, max=404.50 MB, blocks=0, accesses=42069, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=0, evicted=0, evictedPerRun=NaN

2014-07-03 11:36:38,213 INFO  [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Opening socket connection to server u03/192.168.85.130:2181. Will not attempt to authenticate using SASL (无法定位登录配置)

2014-07-03 11:36:38,804 INFO  [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Socket connection established to u03/192.168.85.130:2181, initiating session

2014-07-03 11:36:51,474 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12168ms

2014-07-03 11:36:53,638 INFO  [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnect

2014-07-03 11:37:08,385 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4707ms

2014-07-03 11:37:09,333 WARN  [regionserver60020] util.Sleeper: We slept 14746ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2014-07-03 11:37:09,963 INFO  [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Opening socket connection to server u05/192.168.85.132:2181. Will not attempt to authenticate using SASL (无法定位登录配置)

2014-07-03 11:37:11,451 INFO  [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Socket connection established to u05/192.168.85.132:2181, initiating session

2014-07-03 11:37:16,208 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2610ms

2014-07-03 11:37:20,313 INFO  [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x646f9e22a350000 has expired, closing socket connection

2014-07-03 11:37:20,388 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: ABORTING region server u05,60020,1404351691231: regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000, quorum=u04:2181,u03:2181,u02:2181,u01:2181,u08:2181,u07:2181,u06:2181,u05:2181, baseZNode=/hbase regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000 received expired from ZooKeeper, aborting

2014-07-03 11:37:29,418 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []

...全文

1528 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

qingyuan18 2014-07-22

打赏
举报

是否有javacore文件或者threaddump文件？

vah101 2014-07-21

打赏
举报

gc造成程序连接zookeeper超时导致“received expired from ZooKeeper”，最后挂掉了修改配置文件：hbase-env.sh，配置好heap size，还有gc选项参考：http://wiki.apache.org/hadoop/PerformanceTuning http://www.cnblogs.com/cenyuhai/p/3235101.html