日期:2014-05-16  浏览次数:20486 次

记一次Oracle rac vip启动错误处理
这几天出差在外,由于考试将近,在出差途中,也不忘学习啊,悲剧。由于本机上缺少一套rac环境,所以打算在虚拟机上安装Oracle rac 10g,基础环境为linux as3+Oracle 10.2.0.1。公有网卡和私有网卡采用host only模式。本人在虚拟机上安装rac 10g起码不下5次了,考虑到虚拟机的不稳定性,每次安装都不是很顺利,磕磕碰碰,但每次都基本上在1天之内能完成的,但这一次比较郁闷,花的时间比较长,所以值得注意的地方也比较多(以前由于比较顺利,很多地方忽视了)。如需要注意防火墙关闭,主机和宿机之间的时间同步,共享存储的划分,虚拟机参数的设置,虚拟机之间的网关设置,软件包的安装等等。其中有三个地方需要我们尤其注意,
1、软件包的安装,强烈建议将develop tool全装上,如果你时间多,可以一个一个包慢慢安装!
2、虚拟机参数的设置,为了让同行们少走弯路,虚拟机参数应设置为(注意虚拟机版本为
引用
vmware server 2.0)
disk.locking = "FALSE"
diskLib.dataCacheMaxSize = "0"
diskLib.dataCacheMaxReadAheadSize = "0"
diskLib.dataCacheMinReadAheadSize = "0"
diskLib.dataCachePageSize = "4096"
diskLib.maxUnsyncedWrites = "0"
scsi1.present = "TRUE"
scsi1.virtualDev = "lsilogic"
scsi1.sharedBus = "VIRTUAL"

我正是因为参数设置不准确,碰到了很多稀奇古怪的问题,如asm磁盘组不能同时挂载,或者一节点挂载之后,磁盘头损坏等,为了这个问题几乎折腾了一整天,往事不堪回首!因为我当初只设置了如下三个参数:
引用
disk.locking = "FALSE"
diskLib.dataCacheMaxSize = "0"
scsi1.sharedBus = "VIRTUAL"

3、虚拟机共享磁盘最好需分配好大小,这是经验之谈,虽然找不到什么理论根据,但是实践表明预分配好的磁盘出现坏块的几率大大减小,很多莫名其妙的问题就是坏块引起的。但是预分配好磁盘可能会在一开始占用空间。
4、虚拟机不要设置网关,这个问题困扰了很久,也将近耗时1天,之前我设置了和公有网卡同一网段的网关,设置好,安装crs之后,古怪的事情发生了:
a、节点间vip经常莫名掉线
b、vip地址经常跑错节点,即1节点启动2节点的vip,但是1节点不能启动自身vip地址
c、vip地址不能被nodeapp带动
这三个怪现象困扰了我好久,期间尝试了crs重装升级,网卡从hostonly模式转成bridge模式,更换网关,但是故障依旧。crsd.log错误日志也仅仅显示
引用
2011-06-24 13:33:51.682: [  CRSRES][570047408]0Attempting to start `ora.racsvr1.vip` on member `racsvr1`
2011-06-24 13:34:04.374: [  CRSAPP][570047408]0StartResource error for ora.racsvr1.vip error code = 1
2011-06-24 13:34:07.884: [  CRSRES][570047408]0Start of `ora.racsvr1.vip` on member `racsvr1` failed.

使劲浑身解数,问题还得不到解决,比较郁闷,此时突然想起,单独尝试启动vip,错误不一样了

引用
[oracle@racsvr1 oracle]$ crs_start ora.racsvr1.vip
Attempting to start `ora.racsvr1.vip` on member `racsvr1`
Start of `ora.racsvr1.vip` on member `racsvr1` failed.
CRS-1006: No more members to consider

CRS-0215: Could not start resource 'ora.racsvr1.vip'.


马上查metalink,根据 metalink 文档 CRS-0215: Could not start resource 'ora..vip' [ID 356535.1]
修改$ORA_CRS_HOME/bin/racgvip将参数FAIL_WHEN_DEFAULTGW_NOT_FOUND=0,此参数的意思vip启动时检查默认网关不存在时不进行报错。
设置好参数之后,继续尝试启动,这次采用nodeapp带动vip,继续报错。这次报错和metalink 文档Bug 5076555: VIP CRASHING FAIL_WHEN_DEFAULTGW_NOT_FOUND=0 SEEMS TO BE IGNORED很类似,但是没有解决方案。这里需要鄙视一下Oracle,Oracle既然已经将其归为bug了,竟然无动于衷,解决方案也不给一个。
引用
[oracle@racsvr1 oracle]$ srvctl start nodeapps -n racsvr1
racsvr1:ora.racsvr1.vip:ping to 10.20.30.99 via eth0 failed, rc = 1 (host=racsvr1)
racsvr1:ora.racsvr1.vip:ping to 10.20.30.99 via eth0 failed, rc = 1 (host=racsvr1)
racsvr1:ora.racsvr1.vip:Interface eth0 checked failed (host=racsvr1)
racsvr1:ora.racsvr1.vip:Invalid parameters, or failed to bring up VIP (host=racsvr1)
CRS-1006: No more members to consider
CRS-0215: Could not start resource 'ora.racsvr1.vip'.
racsvr1:ora.racsvr1.vip:ping to 10.20.30.99 via eth0 failed, rc = 1 (host=racsvr1)
racsvr1:ora.racsvr1.vip:ping to 10.20.30.99 via eth0 failed, rc = 1 (host=racsvr1)
racsvr1:ora.racsvr1.vip:Interface eth0 checked failed (host=racsvr1)
racsvr1:ora.racsvr1.vip:Invalid parameters, or failed to bring up VIP (host=racsvr1)
CRS-1006: No more members to consider
CRS-0215: Could not start resource 'ora.racsvr1.LISTENER_RACSVR1.lsnr'.

其实上面提示已经很明确,ping to 10.20.30.99 via eth0 failed,但是我们ping网关完全没问题
引用
[oracle@racsvr1 oracle]$ ping 10.20.30.99
PING 10.20.30.99 (10.20.30.99) 56(84) bytes of data.
64 bytes from