日期:2014-05-16  浏览次数:20601 次

一次TB级ERP(ASM RAC)库的恢复

前不久某客户的ERP 库出现故障(Linux x64,10204 rac ams环境). 大概问题是由于一些列操作之后导致磁盘组无法mount,
只能进行数据恢复,针对该case,我们前后投入了8个人力,进行了3天3夜终于成功抢救该数据库。

首先是客户在rac其中一个节点add disk时,发现在另外节点未添加成功,后面又反复折腾add,甚至dd 盘头进行了add。
最为致命的一个动作是强制add disk,其实在该步骤之前这几个disk已经add过一次,且完成了reblance,但是drop disk
却并未成功,最后客户尝试强制添加,如下:

SQL> ALTER DISKGROUP xxxx ADD  DISK 'ORCL:VOL1_xxx' SIZE 2097152M  FORCE ,
'ORCL:VOL2_xxx' SIZE 2097152M  FORCE ,
'ORCL:VOL3_xxx' SIZE 2097152M  FORCE
........
ORA-15186: ASMLIB error function = [asm_open],  error = [1],  mesg = [Operation not permitted]
Tue Feb 18 06:09:32 2014
SQL> alter diskgroup xxx MOUNT
NOTE: cache registered group xxx number=1 incarn=0x6c42d680
.......
Tue Feb 18 06:09:32 2014
NOTE: Hbeat: instance not first (grp 1)
Tue Feb 18 06:09:32 2014
NOTE: cache dismounting group 1/0x6C42D680 (xxx)
NOTE: dbwr not being msg'd to dismount
Tue Feb 18 06:09:32 2014
NOTE: PST enabling heartbeating (grp 1)
Tue Feb 18 06:09:32 2014
ERROR: diskgroup xxx was not mounted
Tue Feb 18 06:10:22 2014
ORA-15186: ASMLIB error function = [asm_open],  error = [1],  mesg = [Operation not permitted]
Tue Feb 18 06:10:22 2014
.........
最后导致磁盘组都无法mount,当然数据库肯定也无法成功open,会报如下类似的错误;
Tue Feb 18 05:53:57 2014
Errors in file /opt/oracle/admin/xxx/bdump/xxx_lmon_17095.trc:
ORA-00202: control file: '+xxx/xxx/controlfile/current.256.743166671'
ORA-15078: ASM diskgroup was forcibly dismounted
Tue Feb 18 05:53:58 2014