20220212-记一次一波三折的Oracle RAC故障处理

在这里插入图片描述

环境说明:

OS:Redhat 7.5
DB:
Oracle 11.2.0.4.0 RAC
节点1:10.1.1.103
节点2:10.1.1.101

/etc/hosts配置:

cat /etc/hosts
......
10.1.1.103    chen-cjc-01
10.1.1.101    chen-cjc-02
10.1.1.102    chen-cjc-01-vip
10.1.1.103    chen-cjc-02-vip
25.5.255.13   chen-cjc-01-pri
25.5.255.14   chen-cjc-02-pri
10.1.1.105    chen-scan

问题:

节点1,2内存使用率较高,需要进行内存扩容。

解决方案:

停机扩容内存。
1.停应用服务。
2.停库。
3.停集群。
4.停服务器扩容内存。
5.启动集群。
6.启动数据库。
7.应用修改连接数据库地址,改为连接scan ip。
8.启动应用服务。
9.验证。

实施过程中遇到的问题:

1.由于历史原因,应用没有连接数据库scan ip,只连接了节点1 vip 10.1.1.102,失去了高可用功能,需要停应用服务后进行数据库扩容。
2.Oracle ASM共享存储在NAS上,经检查发现没有设置NAS开机自动挂载,需要开机后手动挂载,需要提前将Oracle crs自动启动服务关闭,启动服务器后,先手动挂载NAS,在手动启动CRS。
3.Oracle 11.2.0.4.0 RAC安装在Redhat 7及以上版本,无法直接启动ohas服务,需要打补丁或手动添加ohas服务。
4.启动CRS时,后台日志提示无法找到voting files,导致集群无法启动,根据ocssd日志,优化NAS挂载参数。

具体问题如下:

1.停节点2数据库实例时,速度较慢,耗时9分钟。

主要原因,服务器性能较差,建议下次停实例可以考虑先提前手动中止会话,在停实例。

停止实例日志如下:

Thu Feb 10 18:42:11 2022
Shutting down instance (immediate)
Stopping background process SMCO
Shutting down instance: further logons disabled
Stopping background process QMNC
Stopping background process MMNL
Stopping background process MMON
License high water mark = 23
Thu Feb 10 18:42:38 2022
Reconfiguration started (old inc 4, new inc 6)
List of instances:
 2 (myinst: 2) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Thu Feb 10 18:42:38 2022
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Feb 10 18:42:38 2022
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Thu Feb 10 18:42:38 2022
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Thu Feb 10 18:43:38 2022
Decreasing number of real time LMS from 2 to 0
Thu Feb 10 18:47:14 2022
Active call for process 5123 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 30720 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 1908 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 27349 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 1753 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 6241 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 22201 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 21133 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
SHUTDOWN: waiting for active calls to complete.
Thu Feb 10 18:52:13 2022
License high water mark = 23
USER (ospid: 23318): terminating the instance
Instance terminated by USER, pid = 23318

2.crs无法启动。

启动服务器后,系统工程师挂载NAS,DBA启动crs服务时,前台命令挂起,长时间无响应。

crsctl start crs;

检查后台集群日志、crs日志、ohas日志、ocss日志均没有任何输出。

反复中止、启动crs后,过一段时间,前台报错CRS-3124:

root@chen-cjc-02:/oracle/db/grid/product/11.2.0/bin#./crsctl start crs
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.

问题原因:
Oracle 11.2.0.4.0 RAC安装在Redhat 7及以上版本,无法直接启动ohas服务,需要打补丁或手动添加ohas服务。

处理过程:
创建服务ohas.service的服务文件并赋予权限

su - root
touch /usr/lib/systemd/system/ohas.service
chmod 777 /usr/lib/systemd/system/ohas.service

往ohas.service服务文件添加启动ohasd的相关信息

vi /usr/lib/systemd/system/ohas.service
[Unit]
Description=Oracle High Availability Services
After=syslog.target
[Service]
ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
Restart=always
[Install]
WantedBy=multi-user.target

重新加载守护进程

systemctl daemon-reload

设置守护进程自动启动

systemctl enable ohas.service

手工启动ohas服务

systemctl start ohas.service
systemctl status ohas.service

3.CRS-1714:Unable to discover any voting files

手动添加并启动ohas服务器,启动crs前台没有报错,很快返回结果。

crsctl start crs

检查后台集群日志,集群启动失败。

2022-02-10 19:37:03.771: 
[cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log
2022-02-10 19:37:18.783: 
[cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log

继续检查ocssd.log日志,提示NFS配置不正确,rsize和wsize值不正确。

2022-02-10 19:43:19.113: [   SKGFD][3605935872]running stat on disk:/oradata/vote3
2022-02-10 19:43:19.113: [   SKGFD][3605935872]WARNING:NFS file system /oradata mounted with incorrect options(rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2)
2022-02-10 19:43:19.113: [   SKGFD][3605935872]WARNING:Expected NFS mount options: rsize>=32768,wsize>=32768,hard,

继续检查nfs挂载参数

检查fstab里记录的nas挂载参数

cat /etc/fstab
#mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs  /oradata

检查当前挂载参数

mount |grep oradata
25.17.1.2:/CJC_db_oradata_01_nfs on /oradata type nfs (rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2)

检查历史挂载命令

root@chen-cjc-01:/oracle/db/grid/product/11.2.0/log/chen-cjc-01#history |grep mount
......
 1194  2020-09-16-10:05:37 [root]umount /oradata
 1195  2020-09-16-10:05:44 [root]mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs  /oradata

手动重新挂载

两节点分别停止crs服务

crsctl stop crs
ps -ef|grep d.bin

两节点分别重新挂载

###umount /oradata
###mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs  /oradata

重新挂载后,crs和数据库实例可以正常启动。

验证:

检查数据库五个IP地址是否可以正常连接数据库。

###sqlplus CJC/***@10.1.1.102:15212/CJC
###sqlplus CJC/***@10.1.1.103:15212/CJC
###sqlplus CJC/***@10.1.1.105:15212/CJC
###sqlplus CJC/***@10.1.1.101:15212/CJC
###sqlplus CJC/***@10.1.1.104:15212/CJC

待应用服务修改完IP地址后,检查数据库连接情况,确保节点1,2都有业务连接。

SQL> select inst_id,username,count(*) from gv$session group by inst_id,username order by 1,2;
   INST_ID USERNAME                         COUNT(*)
---------- ------------------------------ ----------
         1 CJC                                   42
         1 MONITOR                                 3
         1 SYS                                     5
         1                                        41
         2 CJC                                   42
         2 SYS                                     3
         2                                        39
7 rows selected.

#####chenjuchao 20220212#####
欢迎关注我的公众号《IT小Chen

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值