环境说明:
OS:Redhat 7.5
DB:
Oracle 11.2.0.4.0 RAC
节点1:10.1.1.103
节点2:10.1.1.101
/etc/hosts配置:
cat /etc/hosts
......
10.1.1.103 chen-cjc-01
10.1.1.101 chen-cjc-02
10.1.1.102 chen-cjc-01-vip
10.1.1.103 chen-cjc-02-vip
25.5.255.13 chen-cjc-01-pri
25.5.255.14 chen-cjc-02-pri
10.1.1.105 chen-scan
问题:
节点1,2内存使用率较高,需要进行内存扩容。
解决方案:
停机扩容内存。
1.停应用服务。
2.停库。
3.停集群。
4.停服务器扩容内存。
5.启动集群。
6.启动数据库。
7.应用修改连接数据库地址,改为连接scan ip。
8.启动应用服务。
9.验证。
实施过程中遇到的问题:
1.由于历史原因,应用没有连接数据库scan ip,只连接了节点1 vip 10.1.1.102,失去了高可用功能,需要停应用服务后进行数据库扩容。
2.Oracle ASM共享存储在NAS上,经检查发现没有设置NAS开机自动挂载,需要开机后手动挂载,需要提前将Oracle crs自动启动服务关闭,启动服务器后,先手动挂载NAS,在手动启动CRS。
3.Oracle 11.2.0.4.0 RAC安装在Redhat 7及以上版本,无法直接启动ohas服务,需要打补丁或手动添加ohas服务。
4.启动CRS时,后台日志提示无法找到voting files,导致集群无法启动,根据ocssd日志,优化NAS挂载参数。
具体问题如下:
1.停节点2数据库实例时,速度较慢,耗时9分钟。
主要原因,服务器性能较差,建议下次停实例可以考虑先提前手动中止会话,在停实例。
停止实例日志如下:
Thu Feb 10 18:42:11 2022
Shutting down instance (immediate)
Stopping background process SMCO
Shutting down instance: further logons disabled
Stopping background process QMNC
Stopping background process MMNL
Stopping background process MMON
License high water mark = 23
Thu Feb 10 18:42:38 2022
Reconfiguration started (old inc 4, new inc 6)
List of instances:
2 (myinst: 2)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Thu Feb 10 18:42:38 2022
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Feb 10 18:42:38 2022
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Thu Feb 10 18:42:38 2022
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Thu Feb 10 18:43:38 2022
Decreasing number of real time LMS from 2 to 0
Thu Feb 10 18:47:14 2022
Active call for process 5123 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 30720 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 1908 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 27349 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 1753 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 6241 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 22201 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 21133 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
SHUTDOWN: waiting for active calls to complete.
Thu Feb 10 18:52:13 2022
License high water mark = 23
USER (ospid: 23318): terminating the instance
Instance terminated by USER, pid = 23318
2.crs无法启动。
启动服务器后,系统工程师挂载NAS,DBA启动crs服务时,前台命令挂起,长时间无响应。
crsctl start crs;
检查后台集群日志、crs日志、ohas日志、ocss日志均没有任何输出。
反复中止、启动crs后,过一段时间,前台报错CRS-3124:
root@chen-cjc-02:/oracle/db/grid/product/11.2.0/bin#./crsctl start crs
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
问题原因:
Oracle 11.2.0.4.0 RAC安装在Redhat 7及以上版本,无法直接启动ohas服务,需要打补丁或手动添加ohas服务。
处理过程:
创建服务ohas.service的服务文件并赋予权限
su - root
touch /usr/lib/systemd/system/ohas.service
chmod 777 /usr/lib/systemd/system/ohas.service
往ohas.service服务文件添加启动ohasd的相关信息
vi /usr/lib/systemd/system/ohas.service
[Unit]
Description=Oracle High Availability Services
After=syslog.target
[Service]
ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
Restart=always
[Install]
WantedBy=multi-user.target
重新加载守护进程
systemctl daemon-reload
设置守护进程自动启动
systemctl enable ohas.service
手工启动ohas服务
systemctl start ohas.service
systemctl status ohas.service
3.CRS-1714:Unable to discover any voting files
手动添加并启动ohas服务器,启动crs前台没有报错,很快返回结果。
crsctl start crs
检查后台集群日志,集群启动失败。
2022-02-10 19:37:03.771:
[cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log
2022-02-10 19:37:18.783:
[cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log
继续检查ocssd.log日志,提示NFS配置不正确,rsize和wsize值不正确。
2022-02-10 19:43:19.113: [ SKGFD][3605935872]running stat on disk:/oradata/vote3
2022-02-10 19:43:19.113: [ SKGFD][3605935872]WARNING:NFS file system /oradata mounted with incorrect options(rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2)
2022-02-10 19:43:19.113: [ SKGFD][3605935872]WARNING:Expected NFS mount options: rsize>=32768,wsize>=32768,hard,
继续检查nfs挂载参数
检查fstab里记录的nas挂载参数
cat /etc/fstab
#mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs /oradata
检查当前挂载参数
mount |grep oradata
25.17.1.2:/CJC_db_oradata_01_nfs on /oradata type nfs (rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2)
检查历史挂载命令
root@chen-cjc-01:/oracle/db/grid/product/11.2.0/log/chen-cjc-01#history |grep mount
......
1194 2020-09-16-10:05:37 [root]umount /oradata
1195 2020-09-16-10:05:44 [root]mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs /oradata
手动重新挂载
两节点分别停止crs服务
crsctl stop crs
ps -ef|grep d.bin
两节点分别重新挂载
###umount /oradata
###mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs /oradata
重新挂载后,crs和数据库实例可以正常启动。
验证:
检查数据库五个IP地址是否可以正常连接数据库。
###sqlplus CJC/***@10.1.1.102:15212/CJC
###sqlplus CJC/***@10.1.1.103:15212/CJC
###sqlplus CJC/***@10.1.1.105:15212/CJC
###sqlplus CJC/***@10.1.1.101:15212/CJC
###sqlplus CJC/***@10.1.1.104:15212/CJC
待应用服务修改完IP地址后,检查数据库连接情况,确保节点1,2都有业务连接。
SQL> select inst_id,username,count(*) from gv$session group by inst_id,username order by 1,2;
INST_ID USERNAME COUNT(*)
---------- ------------------------------ ----------
1 CJC 42
1 MONITOR 3
1 SYS 5
1 41
2 CJC 42
2 SYS 3
2 39
7 rows selected.
#####chenjuchao 20220212#####
欢迎关注我的公众号《IT小Chen》