[HACMP集群] /tmp目录满导致两机HA自动关闭求解!!!!

weixin_38061618 2012-11-23 11:20:58

事件描述:
      21号12点17分,DB01机/tmp目录满告警,同时DB01上HA自动关闭并切换到DB02,
     DB02在12点03分也有/tmp 目录满报警,但很快自动释放到80%,两机/tmp空间大小都为2G,12点18分DB02正常接管HA,
但在13点05分DB02也突然HA自动关闭,因为DB02在13点05分时tmp目录为80%,并没有满,想不明白为什么也会突然关闭HA?
日志如下:

DB01上:

#errpt -a

---
LABEL:          SRC_SVKO
IDENTIFIER:     BC3BE5A3

Date/Time:       Wed Nov 21 12:18:49 BEIST 2012
Sequence Number: 1544
Machine Id:      00C3EEA44C00
Node Id:         DB01
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SRC            

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
         MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
          256
SOFTWARE ERROR CODE
        -9017
ERROR CODE
            0
DETECTING MODULE
'srchevn.c'@line:'376'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------
LABEL:          J2_FS_FULL
IDENTIFIER:     F7FA22C9

Date/Time:       Wed Nov 21 12:17:45 BEIST 2012
Sequence Number: 1543
Machine Id:      00C3EEA44C00
Node Id:         DB01
Class:           O
Type:            INFO
WPAR:            Global
Resource Name:   SYSJ2           

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

        Recommended Actions
         INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
         REMOVE UNNECESSARY DATA FROM FILE SYSTEM
         USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED

Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp

#more cluster.log

Nov 21 12:17:59 DB01 local0:crit clstrmgrES[78500]: Wed Nov 21 12:17:59 HACMP: clstrmgrES: SrcStopForce: Called
Nov 21 12:18:02 DB01 user:notice HACMP for AIX: EVENT START: node_down DB01
Nov 21 12:18:02 DB01 user:notice HACMP for AIX: EVENT START: stop_server ora_monitor
Nov 21 12:18:36 DB01 user:notice HACMP for AIX: EVENT COMPLETED: stop_server ora_monitor 0
Nov 21 12:18:38 DB01 user:notice HACMP for AIX: EVENT START: release_service_addr
Nov 21 12:18:39 DB01 user:notice HACMP for AIX: EVENT COMPLETED: release_service_addr 0
Nov 21 12:18:39 DB01 user:notice HACMP for AIX: EVENT COMPLETED: node_down DB01 0
Nov 21 12:18:46 DB01 user:notice HACMP for AIX: EVENT START: node_down_complete DB01
Nov 21 12:18:46 DB01 user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete DB01 0
Nov 21 12:18:49 DB01 local0:crit clstrmgrES[78500]: Wed Nov 21 12:18:49 HACMP: clstrmgrES: approvalCb: Quit flag was set, exiting
Nov 21 12:18:56 DB01 daemon:notice topsvcs[127554]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6SQG4h/kM3fE/7bW186pl8....................:::Reference ID: 6UpNEL0AXDlD/WzJ.86pl8....................:::Template ID: 6d19271e::etails File:  :cation: rsct,comm.C,1.148,634                         :::TS_STOP_ST Topology Services daemon stopped Topology Services daemon stopped by: Signal SIGTERM


DB02 上:


#errpt -a

LABEL:          J2_FS_FULL
IDENTIFIER:     F7FA22C9

Date/Time:       Wed Nov 21 12:03:52 BEIST 2012
Sequence Number: 1687
Machine Id:      00C3EED44C00
Node Id:         DB02
Class:           O
Type:            INFO
WPAR:            Global
Resource Name:   SYSJ2           

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

        Recommended Actions
         INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
         REMOVE UNNECESSARY DATA FROM FILE SYSTEM
         USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED

Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp

---------------------------------------------------------------------------
LABEL:          TS_LOC_DOWN_ST
IDENTIFIER:     173C787F

Date/Time:       Wed Nov 21 12:19:23 BEIST 2012
Sequence Number: 1690
Machine Id:      00C3EED44C00
Node Id:         DB02
Class:           S
Type:            INFO
WPAR:            Global
Resource Name:   topsvcs         

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

        Recommended Actions
         Verify adapter configuration
         Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.22,4329            
ERROR ID
6zV5DL.9N3fE/5NN096pl8....................
REFERENCE CODE
                                          
Adapter interface name
tty0
Adapter offset
            2
Adapter IP address
255.255.0.1

---------------------------------------------------------------------------
LABEL:          SRC_SVKO
IDENTIFIER:     BC3BE5A3

Date/Time:       Wed Nov 21 13:05:21 BEIST 2012
Sequence Number: 1691
Machine Id:      00C3EED44C00
Node Id:         DB02
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SRC            

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
         MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
          256
SOFTWARE ERROR CODE
        -9017
ERROR CODE
            0
DETECTING MODULE
'srchevn.c'@line:'376'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------

#more cluster.log

Nov 21 12:18:42 DB02 user:notice HACMP for AIX: EVENT START: node_down DB01
Nov 21 12:18:42 DB02 user:notice HACMP for AIX: EVENT START: acquire_takeover_addr
Nov 21 12:18:43 DB02 user:notice HACMP for AIX: EVENT COMPLETED: acquire_takeover_addr 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: node_down DB01 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT START: node_down_complete DB01
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT START: start_server ora_monitor
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: start_server ora_monitor 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete DB01 0
Nov 21 12:18:49 DB02 local0:crit clstrmgrES[74200]: Wed Nov 21 12:18:49 Removing 3 from ml_idx
Nov 21 12:19:23 DB02 daemon:notice topsvcs[164044]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6zV5DL.9N3fE/5NN096pl8....................:::Reference ID: :::Template ID: 173c787f::etails File:  :cation: rsct,nim_control.C,1.39.1.22,4329             :::TS_LOC_DOWN_ST Possible malfunction on local adapter Adapter interface name tty0 Adapter offset 2 Adapter IP address 255.255.0.1
Nov 21 12:19:25 DB02 user:notice HACMP for AIX: EVENT START: network_down minus 1 net_rs232_01
Nov 21 12:19:25 DB02 user:notice HACMP for AIX: EVENT COMPLETED: network_down minus 1 net_rs232_01 0
Nov 21 12:19:26 DB02 user:notice HACMP for AIX: EVENT START: network_down_complete minus 1 net_rs232_01
Nov 21 12:19:26 DB02 user:notice HACMP for AIX: EVENT COMPLETED: network_down_complete minus 1 net_rs232_01 0
Nov 21 13:04:59 DB02 local0:crit clstrmgrES[74200]: Wed Nov 21 13:04:59 HACMP: clstrmgrES: SrcStopForce: Called
Nov 21 13:04:59 DB02 user:notice HACMP for AIX: EVENT START: node_down DB02
Nov 21 13:04:59 DB02 user:notice HACMP for AIX: EVENT START: stop_server ora_monitor



另外怀疑过是不是clstrmgr.debug生成时因为/tmp空间不够大导致的DB02的HA也自动关闭,但我的clstrmgr.debug并没在/tmp下,是在/var/hacmp下的, IZ05428补丁也是打过的,以下网址是找到的一个案例,但和我的不太符合,
http://www-01.ibm.com/support/docview.wss?uid=isg1IZ05428
...全文
104 7 打赏 收藏 转发到动态 举报
AI 作业
写回复
用AI写文章
7 条回复
切换为时间正序
请发表友善的回复…
发表回复

474

社区成员

发帖
与我相关
我的任务
社区描述
其他技术讨论专区
其他 技术论坛(原bbs)
社区管理员
  • 其他技术讨论专区社区
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧