oracle rac一边实例宕了

huangzhimeng 2017-10-25 04:26:03

环境 oracle11204 rac for redhat6.6
9月13日发现一边实例宕了
alert日志：
Errors in file /oracle/app/oracle/diag/rdbms/bssoradb/bssoradb2/trace/bssoradb2_lmon_13671.trc (incident=432089):
ORA-29740: evicted by instance number 1, group incarnation 46
Incident details in: /oracle/app/oracle/diag/rdbms/bssoradb/bssoradb2/incident/incdir_432089/bssoradb2_lmon_13671_i432089.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/oracle/diag/rdbms/bssoradb/bssoradb2/trace/bssoradb2_lmon_13671.trc:
ORA-29740: evicted by instance number 1, group incarnation 46
LMON (ospid: 13671): terminating the instance due to error 29740
Wed Sep 13 18:39:49 2017
System state dump requested by (instance=2, osid=13671 (LMON)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/oracle/diag/rdbms/bssoradb/bssoradb2/trace/bssoradb2_diag_13661_20170913183949.trc
Instance terminated by LMON, pid = 13671

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
对应的/oracle/app/oracle/diag/rdbms/bssoradb/bssoradb2/trace/bssoradb2_lmon_13671.trc
*** 2017-09-13 18:39:44.093
kjxgrrcfgchk: Initiating reconfig, reason=3
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2017-09-13 18:39:44.093242 : kjxgrnetchk: start 0x43f2b35f, end 0x43f369a9
2017-09-13 18:39:44.093265 : kjxgrnetchk: Network Validation wait: 46 sec
2017-09-13 18:39:44.093282 : kjxgrnetchk: Sending comm check req to inst 1
kjxgrrcfgchk: prev pstate 6 mapsz 512
kjxgrrcfgchk: new bmp: 1 2
kjxgrrcfgchk: work bmp: 1 2
kjxgrrcfgchk: rr bmp: 1 2

*** 2017-09-13 18:39:44.093
kjxgmrcfg: Reconfiguration started, type 3
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 44 0.
kjxgrs0h: disable CGS timeout

017-09-13 18:39:44.537303 : kjxgrcomerr: Suppressed nested communications reconfig: instance 1 (44,44)
kjxgrnetval: all instances have acknowledged
kjxgrrcfgchk: NETVAL: reconfig bitmap chksum 0x317eab92 cnt 2 master 2
SelectVoteMethod: member information
Inst 1, st 0x0017, es 0x0002, cap 0x0
Inst 2, st 0x0107, es 0x0000, cap 0x3
SelectVoteMethod: num mounted 1, unmounted 0
SelectVoteMethod: mounted capatility: nonblocking blocking
SelectVoteMethod: num unmounted nb 0 b 0
SelectVoteMethod: total insts nb 1 b 1
SelectVoteMethod: final capability nonblocking
kjxgrpropmsg: SSVOTE: Master indicates Disk Voting required
2017-09-13 18:39:44.589141 : kjxgrDiskVote: nonblocking method is chosen
2017-09-13 18:39:44.640605 : kjxgrDiskVote: start the disk vote w/ seqno 45
2017-09-13 18:39:44.640658 : kjxgrDiskVote: timeout in 56 sec
Last valid bitmap: 1 2
2017-09-13 18:39:44.640725 : kjxgrDiskVote: active members status:
Inst 1, st 0x0017, es 0x0002, cap 0x0
Inst 2, st 0x0107, es 0x0000, cap 0x3
2017-09-13 18:39:44.692219 : kjxgrDiskVote: voted w/ seq 45 and map: 2
LR trace: *** @ 1-LR1: 2017-09-13 18:39:44.744 01
- rcnt-idx 2
LR trace: *** @ 2-LR2: 2017-09-13 18:39:44.744 00
- rcnt-idx 2

*** 2017-09-13 18:39:44.745
kjxgrf_rr_lock: done - ret = 1 hist 0x12e
2017-09-13 18:39:44.745313 : kjxgrDiskVote: RR lock-get failed w/ status 1
2017-09-13 18:39:44.745350 : kjxgrDiskVote: RR update instance is 1
2017-09-13 18:39:44.849246 : kjxgrDiskVote: detected an inconsistent membership by inst 1 at seq 46
2017-09-13 18:39:44.900729 : kjxgrDiskVote: wait 0 sec for membership resolution
2017-09-13 18:39:44.900806 : kjxgrDiskVote: new membership is from inst 1
2017-09-13 18:39:44.900828 : kjxgrDiskVote: bitmap: 1
2017-09-13 18:39:44.952331 : kjxgrdtrt: Evicted by inst 1, seq (46, 46)
IMR state information
Inst 2, thread 2, state 0x4:124c, flags 0x12ca9:0x0001
RR seq commit 44 cur 46
Propstate 4 prv 3 pending 0
rcfg rsn 3, rcfg time 1139979104, mem ct 2
master 2, master rcfg time 1139979104
evicted memcnt 0, starttm 0 chkcnt 0
system load 0 (normal)
nonblocking disk voting method

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[root@bssdb2 ~]# ps -ef|grep grid
root 7804 7628 0 15:43 pts/0 00:00:00 grep grid
root 19240 1 5 Sep20 ? 1-23:44:07 /oracle/app/grid/product/11.2.0/bin/osysmond.bin
grid 19285 1 0 Sep20 ? 03:04:18 /oracle/app/grid/product/11.2.0/bin/ocssd.bin
root 22794 1 0 Oct24 ? 00:06:35 /oracle/app/grid/product/11.2.0/bin/ohasd.bin reboot
root 22887 1 0 Oct24 ? 00:05:20 /oracle/app/grid/product/11.2.0/bin/orarootagent.bin
root 22890 1 0 Oct24 ? 00:01:22 /oracle/app/grid/product/11.2.0/bin/cssdagent
root 22892 1 0 Oct24 ? 00:01:22 /oracle/app/grid/product/11.2.0/bin/cssdmonitor
grid 23000 1 0 Oct24 ? 00:01:10 /oracle/app/grid/product/11.2.0/bin/oraagent.bin
grid 23011 1 0 Oct24 ? 00:00:14 /oracle/app/grid/product/11.2.0/bin/mdnsd.bin
grid 23035 1 0 Oct24 ? 00:01:05 /oracle/app/grid/product/11.2.0/bin/gpnpd.bin
grid 23051 1 0 Oct24 ? 00:05:13 /oracle/app/grid/product/11.2.0/bin/gipcd.bin
root 23140 1 0 Oct24 ? 00:04:34 /oracle/app/grid/product/11.2.0/bin/octssd.bin reboot
grid 23195 1 0 Oct24 ? 00:03:16 /oracle/app/grid/product/11.2.0/bin/evmd.bin

---------------------------------------
ip addr
link/ether 40:f2:e9:de:53:ac brd ff:ff:ff:ff:ff:ff
inet 192.168.111.104/24 brd 192.168.113.255 scope global bond1 #oracle心跳ip
inet 169.254.49.109/16 brd 169.254.255.255 scope global bond1:1 #请问这个 ip会影响oracle rac么
inet6 fe80::42f2:e9ff:fede:53ac/64 scope link
valid_lft forever preferred_lft forever

现在数据库宕了，也拉不起来，请各位大神帮忙 zhu
ASMCMD> lsdg
ASMCMD-8102: no connection to Oracle ASM; command requires Oracle ASM to run

[oracle@bssdb2 ~]$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

[root@bssdb2 ~]# /oracle/app/grid/product/11.2.0/bin/crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.

[root@bssdb2 ~]# /oracle/app/grid/product/11.2.0/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[grid@bssdb2 ~]$ crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE
ora.cluster_interconnect.haip
1 ONLINE ONLINE bssdb2
ora.crf
1 ONLINE ONLINE bssdb2
ora.crsd
1 ONLINE OFFLINE
ora.cssd
1 ONLINE ONLINE bssdb2
ora.cssdmonitor
1 ONLINE ONLINE bssdb2
ora.ctssd
1 ONLINE ONLINE bssdb2 OBSERVER
ora.diskmon
1 OFFLINE OFFLINE
ora.drivers.acfs
1 ONLINE ONLINE bssdb2
ora.evmd
1 ONLINE INTERMEDIATE bssdb2
ora.gipcd
1 ONLINE ONLINE bssdb2
ora.gpnpd
1 ONLINE ONLINE bssdb2
ora.mdnsd
1 ONLINE ONLINE bssdb2
[grid@bssdb2 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 3324
Available space (kbytes) : 258796
ID : 1806684292
Device/File Name : +OCR
Device/File integrity check succeeded

Device/File not configured

Device/File not configured

Device/File not configured

Device/File not configured

Cluster registry integrity check succeeded

Logical corruption check bypassed due to non-privileged user

...全文

851 19 打赏收藏转发到动态举报

写回复

用AI写文章

19 条回复

切换为时间正序

请发表友善的回复…

发表回复

minsic78 2017-10-27

打赏
举报

引用 18 楼 huangzhimeng 的回复:

路由为什么会丢，没人动，是否是oracle内部问题导致的

这个问题我也挺好奇，但我觉得不像，我当时是怀疑有人做安全加固什么的给搞的，没有深究，解决后也未重现。

huangzhimeng 2017-10-27

打赏
举报

路由为什么会丢，没人动，是否是oracle内部问题导致的

碧水幽幽泉 2017-10-26

打赏
举报

登陆试试： sqlplus 用户名/密码 as sysasm

huangzhimeng 2017-10-26

打赏
举报

ora.crsd 1 ONLINE OFFLINE ora.cssd [grid@bssdb2 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.4.0 Production on Thu Oct 26 10:39:53 2017 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to an idle instance. SQL> startup; ORA-03113: end-of-file on communication channel rid@bssdb2 app]$ more ./grid/product/11.2.0/log/diag/asm/+asm/+ASM2/trace/alert_+ASM2.log Sat Apr 23 15:35:06 2016 MEMORY_TARGET defaulting to 1128267776. * instance_number obtained from CSS = 2, checking for the existence of node 0... * node 0 does not exist. instance_number = 2 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 32 Number of processor cores in the system is 16 Number of processor sockets in the system is 2 Private Interface 'bond1:1' configured from GPnP for use as a private interconnect. [name='bond1:1', type=1, ip=169.254.49.109, mac=40-f2-e9-de-53-ac, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'bond0' configured from GPnP for use as a public interface. [name='bond0', type=1, ip=192.168.X.XX, mac=40-f2-e9-de-53-aa, net=192.168.XX.0/24, mask=255.255.255.0, use=public/1] CELL communication is configured to use 0 interface(s): CELL IP affinity details: NUMA status: NUMA system w/ 2 process groups cellaffinity.ora status: cannot find affinity map at '/etc/oracle/cell/network-config/cellaffinity.ora' (see trace file for details) CELL communication will use 1 IP group(s): Grp 0: Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as /oracle/app/grid/product/11.2.0/dbs/arch Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled NOTE: Volume support enabled NUMA system with 2 nodes detected Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options. ORACLE_HOME = /oracle/app/grid/product/11.2.0 System name: Linux Node name: bssdb2 Release: 2.6.32-279.el6.x86_64 Version: #1 SMP Wed Jun 13 18:24:36 EDT 2012 Machine: x86_64 WARNING: using default parameter settings without any parameter file Cluster communication is configured to use the following interface(s) for this instance 169.254.49.109 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Sat Apr 23 15:35:07 2016 PMON started with pid=2, OS id=4322 Sat Apr 23 15:35:07 2016 PSP0 started with pid=3, OS id=4324 Sat Apr 23 15:35:08 2016 VKTM started with pid=4, OS id=4326 at elevated priority VKTM running at (1)millisec precision with DBRM quantum (100)ms Sat Apr 23 15:35:08 2016 GEN0 started with pid=5, OS id=4330 Sat Apr 23 15:35:08 2016 DIAG started with pid=6, OS id=4332 Sat Apr 23 15:35:08 2016 PING started with pid=7, OS id=4334 Sat Apr 23 15:35:08 2016 DIA0 started with pid=8, OS id=4336 Sat Apr 23 15:35:08 2016 LMON started with pid=9, OS id=4338 Sat Apr 23 15:35:11 2016 LMD0 started with pid=10, OS id=4340 * Load Monitor used for high load check * New Low - High Load Threshold Range = [30720 - 40960] Sat Apr 23 15:35:11 2016 LMS0 started with pid=6, OS id=4342 at elevated priority Sat Apr 23 15:35:11 2016 LMHB started with pid=8, OS id=4346 Sat Apr 23 15:35:11 2016 MMAN started with pid=11, OS id=4348 Sat Apr 23 15:35:11 2016 DBW0 started with pid=12, OS id=4350 Sat Apr 23 15:35:11 2016 LGWR started with pid=13, OS id=4352 Sat Apr 23 15:35:11 2016 CKPT started with pid=14, OS id=4354 Sat Apr 23 15:35:11 2016 SMON started with pid=15, OS id=4356 Sat Apr 23 15:35:11 2016 RBAL started with pid=16, OS id=4358 Sat Apr 23 15:35:11 2016 GMON started with pid=17, OS id=4360 Sat Apr 23 15:35:11 2016 MMON started with pid=18, OS id=4362 Sat Apr 23 15:35:11 2016 MMNL started with pid=19, OS id=4364 Restarting dead background process DIAG LMON (ospid: 4338): terminating the instance due to error 481

minsic78 2017-10-26

打赏
举报

引用 7 楼 minsic78 的回复:

[quote=引用 6 楼 huangzhimeng 的回复:] 9月13日开始出的问题，从这个时间截的日志

楼主你没有回答问题，你上面4楼的日志中，ASM实例最后是自动startup的，但是为什么没有接下来的日志？从现在的情况来看，它重启了，但因为遇到某些问题而未能自动启动，遇到了什么问题？如果你觉贴日志太麻烦，那么直接手工重启ASM实例看看，是不是会报点直观的错误[/quote] 手工重启： su - grid sqlplus '/as sysasm' startup

minsic78 2017-10-26

打赏
举报

引用 6 楼 huangzhimeng 的回复:

9月13日开始出的问题，从这个时间截的日志

楼主你没有回答问题，你上面4楼的日志中，ASM实例最后是自动startup的，但是为什么没有接下来的日志？从现在的情况来看，它重启了，但因为遇到某些问题而未能自动启动，遇到了什么问题？如果你觉贴日志太麻烦，那么直接手工重启ASM实例看看，是不是会报点直观的错误

huangzhimeng 2017-10-26

打赏
举报

9月13日开始出的问题，从这个时间截的日志

minsic78 2017-10-26

打赏
举报

引用 16 楼 huangzhimeng 的回复:

[quote=引用 15 楼 minsic78 的回复:] [quote=引用 14 楼 huangzhimeng 的回复:] [quote=引用 12 楼 minsic78 的回复:] 想起今年早些时候处理过的一个案例，因为HAIP的路由信息丢失导致的，症状类似，如果不想费事查，可以先排除下这个原因：两个节点上都执行netstat -rn，看下是不是有丢失169.254.x.x的路由信息的

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond1 路由在的 oracle有个工具可以搜集那个时间段所有日志，我提了sr了！[/quote] 两个节点都在？如果是这种情况的话，少路由的应该是启动的，也就是存活节点。提了SR也好，楼主记得到时候来这里贴下oracle的回复

[/quote] 服务正常的服务器没有路由，异常的有，说明心跳出问题了？我的裁决盘有问题？[/quote] bingo！看来中了，就是这个问题，就是路由丢失导致另外一个节点起不来的，在存活节点上添加路由，你现在宕掉的节点应该就能启动了。

huangzhimeng 2017-10-26

打赏
举报

引用 15 楼 minsic78 的回复:

[quote=引用 14 楼 huangzhimeng 的回复:] [quote=引用 12 楼 minsic78 的回复:] 想起今年早些时候处理过的一个案例，因为HAIP的路由信息丢失导致的，症状类似，如果不想费事查，可以先排除下这个原因：两个节点上都执行netstat -rn，看下是不是有丢失169.254.x.x的路由信息的

[/quote] 服务正常的服务器没有路由，异常的有，说明心跳出问题了？我的裁决盘有问题？

minsic78 2017-10-26

打赏
举报

引用 14 楼 huangzhimeng 的回复:

[quote=引用 12 楼 minsic78 的回复:] 想起今年早些时候处理过的一个案例，因为HAIP的路由信息丢失导致的，症状类似，如果不想费事查，可以先排除下这个原因：两个节点上都执行netstat -rn，看下是不是有丢失169.254.x.x的路由信息的

huangzhimeng 2017-10-26

打赏
举报

引用 12 楼 minsic78 的回复:

想起今年早些时候处理过的一个案例，因为HAIP的路由信息丢失导致的，症状类似，如果不想费事查，可以先排除下这个原因：两个节点上都执行netstat -rn，看下是不是有丢失169.254.x.x的路由信息的

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond1 路由在的 oracle有个工具可以搜集那个时间段所有日志，我提了sr了！

minsic78 2017-10-26

打赏
举报

引用 12 楼 minsic78 的回复:

想起今年早些时候处理过的一个案例，因为HAIP的路由信息丢失导致的，症状类似，如果不想费事查，可以先排除下这个原因：两个节点上都执行netstat -rn，看下是不是有丢失169.254.x.x的路由信息的

这个问题重现也非常容易，但是查起来非常坑人，曾有段时间加到了公司DBA招聘的实操题中

minsic78 2017-10-26

打赏
举报

想起今年早些时候处理过的一个案例，因为HAIP的路由信息丢失导致的，症状类似，如果不想费事查，可以先排除下这个原因：两个节点上都执行netstat -rn，看下是不是有丢失169.254.x.x的路由信息的

minsic78 2017-10-26

打赏
举报

引用 9 楼 huangzhimeng 的回复:

ora.crsd 1 ONLINE OFFLINE ora.cssd [grid@bssdb2 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.4.0 Production on Thu Oct 26 10:39:53 2017 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to an idle instance. SQL> startup; ORA-03113: end-of-file on communication channel rid@bssdb2 app]$ more ./grid/product/11.2.0/log/diag/asm/+asm/+ASM2/trace/alert_+ASM2.log Sat Apr 23 15:35:06 2016 MEMORY_TARGET defaulting to 1128267776. * instance_number obtained from CSS = 2, checking for the existence of node 0... * node 0 does not exist. instance_number = 2 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 32 Number of processor cores in the system is 16 Number of processor sockets in the system is 2 Private Interface 'bond1:1' configured from GPnP for use as a private interconnect. [name='bond1:1', type=1, ip=169.254.49.109, mac=40-f2-e9-de-53-ac, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'bond0' configured from GPnP for use as a public interface. [name='bond0', type=1, ip=192.168.X.XX, mac=40-f2-e9-de-53-aa, net=192.168.XX.0/24, mask=255.255.255.0, use=public/1] CELL communication is configured to use 0 interface(s): CELL IP affinity details: NUMA status: NUMA system w/ 2 process groups cellaffinity.ora status: cannot find affinity map at '/etc/oracle/cell/network-config/cellaffinity.ora' (see trace file for details) CELL communication will use 1 IP group(s): Grp 0: Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as /oracle/app/grid/product/11.2.0/dbs/arch Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled NOTE: Volume support enabled NUMA system with 2 nodes detected Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options. ORACLE_HOME = /oracle/app/grid/product/11.2.0 System name: Linux Node name: bssdb2 Release: 2.6.32-279.el6.x86_64 Version: #1 SMP Wed Jun 13 18:24:36 EDT 2012 Machine: x86_64 WARNING: using default parameter settings without any parameter file Cluster communication is configured to use the following interface(s) for this instance 169.254.49.109 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Sat Apr 23 15:35:07 2016 PMON started with pid=2, OS id=4322 Sat Apr 23 15:35:07 2016 PSP0 started with pid=3, OS id=4324 Sat Apr 23 15:35:08 2016 VKTM started with pid=4, OS id=4326 at elevated priority VKTM running at (1)millisec precision with DBRM quantum (100)ms Sat Apr 23 15:35:08 2016 GEN0 started with pid=5, OS id=4330 Sat Apr 23 15:35:08 2016 DIAG started with pid=6, OS id=4332 Sat Apr 23 15:35:08 2016 PING started with pid=7, OS id=4334 Sat Apr 23 15:35:08 2016 DIA0 started with pid=8, OS id=4336 Sat Apr 23 15:35:08 2016 LMON started with pid=9, OS id=4338 Sat Apr 23 15:35:11 2016 LMD0 started with pid=10, OS id=4340 * Load Monitor used for high load check * New Low - High Load Threshold Range = [30720 - 40960] Sat Apr 23 15:35:11 2016 LMS0 started with pid=6, OS id=4342 at elevated priority Sat Apr 23 15:35:11 2016 LMHB started with pid=8, OS id=4346 Sat Apr 23 15:35:11 2016 MMAN started with pid=11, OS id=4348 Sat Apr 23 15:35:11 2016 DBW0 started with pid=12, OS id=4350 Sat Apr 23 15:35:11 2016 LGWR started with pid=13, OS id=4352 Sat Apr 23 15:35:11 2016 CKPT started with pid=14, OS id=4354 Sat Apr 23 15:35:11 2016 SMON started with pid=15, OS id=4356 Sat Apr 23 15:35:11 2016 RBAL started with pid=16, OS id=4358 Sat Apr 23 15:35:11 2016 GMON started with pid=17, OS id=4360 Sat Apr 23 15:35:11 2016 MMON started with pid=18, OS id=4362 Sat Apr 23 15:35:11 2016 MMNL started with pid=19, OS id=4364 Restarting dead background process DIAG LMON (ospid: 4338): terminating the instance due to error 481

这就是说ASM实例一启动就自杀了……这个481错误而且信息量太少了：

引用

oerr ora 481 00481, 00000, "LMON process terminated with error" // *Cause: The global enqueue service monitor process died // *Action: Warm start instance

看来得换个方向调查了，楼主看下GI的alert日志吧，看看有没有什么可疑的输出，不一定是ASM实例挂掉或者重启失败时候的错误，看看这个之前是不是有什么异常。

minsic78 2017-10-25

打赏
举报

引用 4 楼 huangzhimeng 的回复:

[quote=引用 3 楼 minsic78 的回复:] [quote=引用 2 楼 huangzhimeng 的回复:] [quote=引用 1 楼 minsic78 的回复:] 好乱。 1、看lmon的trc，好像是表决盘连不上了？看下操作系统层面同个时间点上是不是有存储相关问题，是否还在持续； 2、crsctl stat res -t -init看下初始服务有没有都起来，看你帖子里，好像ASM实例也没启动，很可能还是存储的问题。

帖子末尾有关于表决盘check，asm确实没启动来！[/quote] 那么就应该查ASM的日志了，但如果涉及到磁盘下线，急的话直接看操作系统日志得了，十有八九是存储链路的问题[/quote] 还有个实例在，不急 NOTE: ASM client bssoradb2:bssoradb disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /oracle/app/grid/product/11.2.0/rdbms/log/+asm2_ora_31393.trc NOTE: client bssoradb2:bssoradb registered, osid 13715, mbr 0x1 Fri Aug 25 19:22:45 2017 Warning: VKTM detected a time drift. Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details. Wed Sep 13 18:39:48 2017 NOTE: ASM client bssoradb2:bssoradb disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /oracle/app/grid/product/11.2.0/rdbms/log/+asm2_ora_13715.trc NOTE: client bssoradb2:bssoradb registered, osid 26543, mbr 0x1 Wed Sep 13 18:40:19 2017 IPC Send timeout detected. Sender: ospid 1914 [oracle@bssdb2 (PING)] Receiver: inst 1 binc 1458570609 ospid 10330 Wed Sep 13 18:41:15 2017 NOTE: ASM client bssoradb2:bssoradb disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /oracle/app/grid/product/11.2.0/rdbms/log/+asm2_ora_26543.trc Wed Sep 13 18:42:19 2017 Detected an inconsistent instance membership by instance 1 Wed Sep 13 18:42:19 2017 Errors in file /oracle/app/grid/product/11.2.0/rdbms/log/+asm2_lmon_1919.trc: ORA-29740: evicted by instance number 1, group incarnation 18 Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Sep 13 18:42:20 2017 Errors in file /oracle/app/grid/product/11.2.0/rdbms/log/+asm2_lmon_1919.trc: ORA-29740: evicted by instance number 1, group incarnation 18 LMON (ospid: 1919): terminating the instance due to error 29740 Wed Sep 13 18:42:24 2017 System state dump requested by (instance=2, osid=1919 (LMON)), summary=[abnormal instance termination]. System State dumped to trace file /oracle/app/grid/product/11.2.0/rdbms/log/+asm2_diag_1912.trc Dumping diagnostic data in directory=[cdmp_20170913184221], requested by (instance=2, osid=1919 (LMON)), summary=[abnormal instance termination]. Wed Sep 13 18:42:24 2017 Instance terminated by LMON, pid = 1919 NOTE: No asm libraries found in the system MEMORY_TARGET defaulting to 1128267776. * instance_number obtained from CSS = 2, checking for the existence of node 0... * node 0 does not exist. instance_number = 2 Error with dbgriap_init_adr_pga: 48178 ORA-48178: error encountered while reading an ADR block file during ADR initialization [/oracle/app/oracle/diag/asm/+asm/+ASM2/metadata/ADR_INTERNAL.mif] ORA-48122: error with opening the ADR block file [/oracle/app/oracle/diag/asm/+asm/+ASM2/metadata/ADR_INTERNAL.mif] [0] ORA-27041: unable to open file Linux-x86_64 Error: 13: Permission denied Additional information: 3 Wed Sep 13 18:42:25 2017 ERROR: The process is unable to create the ADR schema in the diagnostic_dest directory ERROR: because of a disk issue or OS platform issue Wed Sep 13 18:42:25 2017 ERROR: Reverting back to using the user_dump_dest and background_dump_dest ERROR: as the location for the traces and logs Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0[/quote] 前面是不是还有其他错误？这段日志到最后，好像你的操作系统上本地文件系统空间是满了还是怎么了？而且再次启动的时候日志怎么就没了……

huangzhimeng 2017-10-25

打赏
举报

引用 3 楼 minsic78 的回复:

[quote=引用 2 楼 huangzhimeng 的回复:] [quote=引用 1 楼 minsic78 的回复:] 好乱。 1、看lmon的trc，好像是表决盘连不上了？看下操作系统层面同个时间点上是不是有存储相关问题，是否还在持续； 2、crsctl stat res -t -init看下初始服务有没有都起来，看你帖子里，好像ASM实例也没启动，很可能还是存储的问题。

minsic78 2017-10-25

打赏
举报

引用 2 楼 huangzhimeng 的回复:

[quote=引用 1 楼 minsic78 的回复:] 好乱。 1、看lmon的trc，好像是表决盘连不上了？看下操作系统层面同个时间点上是不是有存储相关问题，是否还在持续； 2、crsctl stat res -t -init看下初始服务有没有都起来，看你帖子里，好像ASM实例也没启动，很可能还是存储的问题。

帖子末尾有关于表决盘check，asm确实没启动来！[/quote] 那么就应该查ASM的日志了，但如果涉及到磁盘下线，急的话直接看操作系统日志得了，十有八九是存储链路的问题

huangzhimeng 2017-10-25