HBase 异常宕机的原因?

the_gunner 2014-07-03 12:38:13
在我们公司的集群中,由于配置不是很高,HBase经常Regionserver 或者 HMaster宕掉,但是不太清楚具体原因。我推测是否是Map Reduce任务与HBase抢系统资源? 因为整个HBase启动起来后,如果不同时进行一些Map Reduce任务的话,是不会出问题的。通常是在执行导入的Map Reduce任务时,容易宕机。

求大神证实一下原因

下面是找到的一些宕机时日志的信息:


2014-07-03 11:35:00,516 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10979ms
2014-07-03 11:35:16,731 WARN [regionserver60020] util.Sleeper: We slept 16189ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-07-03 11:35:16,746 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12933ms
2014-07-03 11:35:29,801 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6797ms
2014-07-03 11:35:31,768 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1466ms
2014-07-03 11:35:31,768 INFO [regionserver60020-SendThread(u07:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 66866ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnect
2014-07-03 11:35:36,592 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4307ms
2014-07-03 11:35:49,857 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 20056ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-07-03 11:35:49,858 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12078ms
2014-07-03 11:35:52,555 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2029ms
2014-07-03 11:35:58,543 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5487ms
2014-07-03 11:36:02,560 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2094ms
2014-07-03 11:36:06,415 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3354ms
2014-07-03 11:36:08,135 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Opening socket connection to server u04/192.168.85.131:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:36:08,288 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1220ms
2014-07-03 11:36:11,092 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Socket connection established to u04/192.168.85.131:2181, initiating session
2014-07-03 11:36:12,170 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnect
2014-07-03 11:36:15,675 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Opening socket connection to server u02/192.168.85.129:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:36:15,686 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Socket connection established to u02/192.168.85.129:2181, initiating session
2014-07-03 11:36:25,135 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3709ms
2014-07-03 11:36:34,095 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8459ms
2014-07-03 11:36:34,098 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 17147ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnect
2014-07-03 11:36:34,772 DEBUG [LruStats #0] hfile.LruBlockCache: Total=1.90 MB, free=402.60 MB, max=404.50 MB, blocks=0, accesses=42069, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=0, evicted=0, evictedPerRun=NaN
2014-07-03 11:36:38,213 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Opening socket connection to server u03/192.168.85.130:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:36:38,804 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Socket connection established to u03/192.168.85.130:2181, initiating session
2014-07-03 11:36:51,474 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12168ms
2014-07-03 11:36:53,638 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnect
2014-07-03 11:37:08,385 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4707ms
2014-07-03 11:37:09,333 WARN [regionserver60020] util.Sleeper: We slept 14746ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-07-03 11:37:09,963 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Opening socket connection to server u05/192.168.85.132:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
2014-07-03 11:37:11,451 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Socket connection established to u05/192.168.85.132:2181, initiating session
2014-07-03 11:37:16,208 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2610ms
2014-07-03 11:37:20,313 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x646f9e22a350000 has expired, closing socket connection
2014-07-03 11:37:20,388 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: ABORTING region server u05,60020,1404351691231: regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000, quorum=u04:2181,u03:2181,u02:2181,u01:2181,u08:2181,u07:2181,u06:2181,u05:2181, baseZNode=/hbase regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000 received expired from ZooKeeper, aborting
2014-07-03 11:37:29,418 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
...全文
1411 3 打赏 收藏 转发到动态 举报
写回复
用AI写文章
3 条回复
切换为时间正序
请发表友善的回复…
发表回复
qingyuan18 2014-07-22
  • 打赏
  • 举报
回复
是否有javacore文件或者threaddump文件?
vah101 2014-07-21
  • 打赏
  • 举报
回复
gc造成程序连接zookeeper超时导致“received expired from ZooKeeper”,最后挂掉了 修改配置文件:hbase-env.sh,配置好heap size,还有gc选项 参考:http://wiki.apache.org/hadoop/PerformanceTuning http://www.cnblogs.com/cenyuhai/p/3235101.html
weitao1234 2014-07-19
  • 打赏
  • 举报
回复
垃圾回收站进行GC时Regionserver 或者 HMaster宕掉,可以尝试将JVM的内存调大。
延云YDB安装与使用说明书 超千亿规模的数据,数据库根本就运行不了,怎么办? 数据从产生到能够查询,要延迟一天才能看到,如何能做到分钟级延迟? 50台规模的hadoop集群,几亿条数据,一个MR任务要运行几小时,每天也就能进行几百次查询。 如何能让任务的执行时间缩短到秒级响应,每天能执行千万次查询。 Hbase只接受KV形式的存储,数万个维度的大宽表,如何进行多维索引? Storm流计算能预计算固定的维度、粒度,但业务千变万化,突发事件很多,如何对任意维度的组合进行筛选、钻取、统计? 硬盘坏了,机器宕机,怎样做到数据可靠不丢失? 小型机太贵,我们买不起,怎么办? YDB特性 1. 千亿规模 在真实业务环境上验证,每天可达千亿增量,总数据量可达几万亿 。 2. 低延迟 数据从产生到能查询,根据配置的不同一般在十几秒到几分钟。 3. 查询快-高性能 常规查询毫秒级响应 常规统计秒级响应。 4. 实时搜索 长文本字段可以根据关键词进行全文检索模糊匹配,并且有较高的性能。 5. 多维钻取 支持上万个维度,任意组合查询,任意维度组合过滤、分组,统计、排序。 6. 容灾可靠 索引存储在分布式文件系统中,不因硬件的损坏或异常宕机而丢失数据。 7. Sql Api: 更易于上手与使用。

20,808

社区成员

发帖
与我相关
我的任务
社区描述
Hadoop生态大数据交流社区,致力于有Hadoop,hive,Spark,Hbase,Flink,ClickHouse,Kafka,数据仓库,大数据集群运维技术分享和交流等。致力于收集优质的博客
社区管理员
  • 分布式计算/Hadoop社区
  • 涤生大数据
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧