环境说明:
DB:Oracle 11.2.0.4.0 RAC 两节点
OS:Redhat 7.6
Oracle RAC软件权限修复_小麦苗的技术博客_51CTO博客
故障说明:
突然收到如下告警
【XXXX】[2021-xx-xx xx:xx:xx]-节点[CJC系统数据库节点2],连接失败!java.sql.SQLRecoverableException: IO Error: Got minus one from a read call[XXX][XX]
问题分析:
根据Got minus one from a read call提示,根据以往经验,初步怀疑数据库连接数满了。
1 首先要确定影响范围。
检查数据库是否可以连接。
ssh登录到本地,执行sqlplus。
sqlplus / as sysdba
数据库连接登录,但是无法执行任何SQL语句,自动断开连接,看现象又不像连接数满的问题。
目前影响范围:RAC节点1可以正常使用,节点2无法对外提供服务,如果有应用单指节点2,就会对业务有影响了。
2 检查数据库告警日志
发现数据库告警日志最近几天一直没有更新过,即数据库异常后,告警日志没有任何输出信息。
3 检查RAC集群日志
tail -10000 $GRID_HOME/log/`hostname`/alert*.log|more
发现故障期间有如下错误:
2021-04-16 18:31:01.023: [/oracle/crs/bin/oraagent.bin(2961)]CRS-5013:Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" 2021-04-16 18:32:01.021: [/oracle/crs/bin/oraagent.bin(2961)]CRS-5013:Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log"
根据提示,继续查看oraagent_grid.log日志
vi /oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log
发现了如下日志:
2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] (:CLSN00008:) category: -1, operation: fail, loc: canexec2, OS error: 0, other: no exe permission, file [/oracle/crs/opmn/bin/onsctli] 2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] clsnUtils::error Exception type=2 string= 可疑点如下:
no exe permission, file [/oracle/crs/opmn/bin/onsctli]
查看onsctli文件权限
[root@cjcdb02 ~]# ls -l /oracle/crs/opmn/bin/onsctli
-rwx------ 1 oracle oinstall 13760 Jan 9 2020 /oracle/crs/opmn/bin/onsctli
在查看正常节点文件权限
[root@cjcdb01 db_check]# ls -l /oracle/crs/opmn/bin/onsctli
-rwx------ 1 grid oinstall 13760 Jan 9 2020 /oracle/crs/opmn/bin/onsctli
可以看到,文件所有者由grid变成了oracle。
权限为什么突然变了呢?除了这个文件以外,还有其他文件权限也变了吗?
继续查看权限,发现ORACLE_HOME,GRID_HOME下所有文件都变了,权限都变成了oracle:oinstall
4 问题原因
为什么文件权限变了,由于不是我负责的数据库,具体原因不清楚,
估计是厂商在调试系统时误操作了,更改权限是指定了错误的目录并且接R递归修改了目录下所有文件(当前系统没有正式上线)。
5 解决方案
当前RAC第二节点ORACLE_HOME和GRID_HOME权限丢失,由于之前没有对文件权限做过备份,只能通过正常的节点1的权限对比来修复节点2文件权限。 但是文件数有80多万,手动比对和修复显然不现实,其实ORACLE已经提供了工具,可以批量生成指定目录下所有文件权限信息。 可以参考MOS:
Script to capture and restore file permission in a directory (for eg. ORACLE_HOME) (Doc ID1515018.1)
具体操作如下: (1) 下载permission.pl脚本 (2) 上传到节点1(正常节点) (3) 修改脚本权限 $ chmod 755 permission.pl (4) 捕获某目录下所有文件权限信息 ./permission.pl <Path name to capture permission> (5) 将执行脚本生成的文件上传到节点2(权限异常节点) 有如下两个脚本: a. permission-<time stamp> - This contains file permission in octal value, owner and group information of the files captured b. restore-perm-<time stamp>.cmd - This contains command to change the permission, owner, and group of the captured files (6) 根据restore-perm-<time stamp>.cmd脚本内容,修改目录位置,将节点1目录修改为节点2目录。 例如: :%s/cjcdb01/cjcdb02/g :%s/ASM1/ASM2/g (7) 节点2执行修复后的restore-perm-<time stamp>.cmd脚本 (8) 重启节点2crs及实例。
6 问题重现测试
(1)备份文件
[root@cjcdb02 ~]# cd /oracle/crs/bin/ [root@cjcdb02 bin]# ./crsctl stop crs [root@cjcdb02 ~]# tar -zcvf oracle_20210416bak.tar.gz /oracle
(2)修改权限,模拟误操作
[root@cjcdb02 ~]# chown oracle.oinstall /oracle -R
(3)启动数据库
如果是在数据库启动情况下,修改的权限,数据库无法正常使用,无法通过shutdown immediate进行关闭,
可以使用shutdown abort方式关闭。
[oracle@cjcdb02 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Fri Apr 16 18:31:06 2021 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to an idle instance. SQL> startup ORA-01078: failure in processing system parameters ORA-01565: error in identifying file '+DATA/kxx/spfilekxx.ora' ORA-17503: ksfdopn:10 Failed to open file +DATA/kxx/spfilekxx.ora ORA-01034: ORACLE not available ORA-27123: unable to attach to shared memory segment Linux-x86_64 Error: 13: Permission denied Additional information: 26 Additional information: 1310725
报错:Permission denied
也无法关闭crs
[root@cjcdb02 bin]# ./crsctl stop crs CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'cjcdb02' CRS-2673: Attempting to stop 'ora.crsd' on 'cjcdb02' CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'cjcdb02' CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'cjcdb02' CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'cjcdb02' CRS-2673: Attempting to stop 'ora.OCR.dg' on 'cjcdb02' CRS-2673: Attempting to stop 'ora.DATA.dg' on 'cjcdb02' CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'cjcdb02' succeeded CRS-2673: Attempting to stop 'ora.cjcdb02.vip' on 'cjcdb02' CRS-2677: Stop of 'ora.ARCH.dg' on 'cjcdb02' succeeded CRS-2677: Stop of 'ora.DATA.dg' on 'cjcdb02' succeeded CRS-2677: Stop of 'ora.cjcdb02.vip' on 'cjcdb02' succeeded CRS-2672: Attempting to start 'ora.cjcdb02.vip' on 'cjcdb01' CRS-2676: Start of 'ora.cjcdb02.vip' on 'cjcdb01' succeeded CRS-2677: Stop of 'ora.OCR.dg' on 'cjcdb02' succeeded CRS-2673: Attempting to stop 'ora.asm' on 'cjcdb02' CRS-2677: Stop of 'ora.asm' on 'cjcdb02' succeeded CRS-2673: Attempting to stop 'ora.ons' on 'cjcdb02' CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "stop": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" CRS-2675: Stop of 'ora.ons' on 'cjcdb02' failed CRS-2679: Attempting to clean 'ora.ons' on 'cjcdb02' CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "clean": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" CRS-2678: 'ora.ons' on 'cjcdb02' has experienced an unrecoverable failure CRS-0267: Human intervention required to resume its availability. CRS-2799: Failed to shut down resource 'ora.net1.network' on 'cjcdb02' CRS-2799: Failed to shut down resource 'ora.ons' on 'cjcdb02' CRS-2799: Failed to shut down resource 'ora.cjcdb02.ons' on 'cjcdb02' CRS-2794: Shutdown of Cluster Ready Services-managed resources on 'cjcdb02' has failed CRS-5022: Stop of resource "ora.crsd" failed: current state is "UNKNOWN" CRS-2675: Stop of 'ora.crsd' on 'cjcdb02' failed CRS-2799: Failed to shut down resource 'ora.crsd' on 'cjcdb02' CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'cjcdb02' has failed CRS-4687: Shutdown command has completed with errors. CRS-4000: Command Stop failed, or completed with errors.
[root@cjcdb02 bin]# ./crsctl stop crs -f CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'cjcdb02' CRS-2673: Attempting to stop 'ora.crsd' on 'cjcdb02' CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'cjcdb02' CRS-2673: Attempting to stop 'ora.net1.network' on 'cjcdb02' CRS-2679: Attempting to clean 'ora.ons' on 'cjcdb02' CRS-2677: Stop of 'ora.net1.network' on 'cjcdb02' succeeded CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "clean": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" CRS-5013: Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" CRS-2680: Clean of 'ora.ons' on 'cjcdb02' failed CRS-2799: Failed to shut down resource 'ora.ons' on 'cjcdb02' CRS-2799: Failed to shut down resource 'ora.cjcdb02.ons' on 'cjcdb02' CRS-2794: Shutdown of Cluster Ready Services-managed resources on 'cjcdb02' has failed CRS-5022: Stop of resource "ora.crsd" failed: current state is "UNKNOWN" CRS-2675: Stop of 'ora.crsd' on 'cjcdb02' failed CRS-2799: Failed to shut down resource 'ora.crsd' on 'cjcdb02' CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'cjcdb02' has failed CRS-4687: Shutdown command has completed with errors. CRS-4000: Command Stop failed, or completed with errors.
[root@cjcdb02 bin]# ps -ef|grep d.bin root 1041 1 0 18:25 ? 00:00:02 /oracle/crs/bin/ohasd.bin reboot root 1140 1 0 Apr06 ? 00:00:00 /opt/ant-agent/agent/embedded/bin/python -m upgrade root 1370 1 0 Apr06 ? 00:00:49 /opt/ant-agent/agent/embedded/bin/python /opt/ant-agent/agent/bin/circled --config /opt/ant-agent/agent/proc --log-output /opt/ant-agent/agent/logs/circle.log --daemon root 1375 1370 0 Apr06 ? 00:16:36 /opt/ant-agent/agent/embedded/bin/python -m framework root 1380 1370 0 Apr06 ? 00:00:00 /opt/ant-agent/agent/embedded/bin/python ./bin/salt-minion -c ./conf root 1381 1370 0 Apr06 ? 00:48:40 /opt/ant-agent/agent/embedded/bin/python ./agent/collect.py --use-local-forwarder grid 1384 1 0 18:25 ? 00:00:00 /oracle/crs/bin/mdnsd.bin root 1392 1370 0 Apr06 ? 00:12:53 /opt/ant-agent/agent/embedded/bin/python ./agent/forwarder.py grid 1404 1 0 18:25 ? 00:00:00 /oracle/crs/bin/gpnpd.bin grid 1421 1 0 18:25 ? 00:00:01 /oracle/crs/bin/gipcd.bin root 1435 1 1 18:25 ? 00:00:06 /oracle/crs/bin/osysmond.bin grid 1476 1 0 18:25 ? 00:00:02 /oracle/crs/bin/ocssd.bin root 1512 1380 0 Apr06 ? 00:21:00 /opt/ant-agent/agent/embedded/bin/python ./bin/salt-minion -c ./conf root 1514 1512 0 Apr06 ? 00:00:00 /opt/ant-agent/agent/embedded/bin/python ./bin/salt-minion -c ./conf root 2191 1 0 18:26 ? 00:00:00 /oracle/crs/bin/octssd.bin reboot grid 2342 1 0 18:26 ? 00:00:00 /oracle/crs/bin/evmd.bin root 2732 1 0 18:26 ? 00:00:02 /oracle/crs/bin/crsd.bin reboot root 8108 30161 0 18:35 pts/0 00:00:00 grep --color=auto d.bin
(4)查看集群日志可以看到权限有问题
集群日志
2021-04-16 18:31:01.023: [/oracle/crs/bin/oraagent.bin(2961)]CRS-5013:Agent "/oracle/crs/bin/oraagent.bin" failed to start process "/oracle/crs/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log" [root@cjcdb02 ~]# vi /oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log 2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] (:CLSN00008:) category: -1, operation: fail, loc: canexec2, OS error: 0, other: no exe permission, file [/oracle/crs/opmn/bin/onsctli] 2021-04-16 18:33:01.030: [ora.ons][3472258816]{2:31203:2} [check] clsnUtils::error Exception type=2 string= [root@cjcdb02 ~]# ls -l /oracle/crs/opmn/bin/onsctli -rwx------ 1 oracle oinstall 13760 Jan 9 2020 /oracle/crs/opmn/bin/onsctli
(5)修复问题
节点1 [root@cjcdb01 db_check]# pwd /db_check [root@cjcdb01 db_check]# ls -lrth total 4.0K -rw-r--r-- 1 root root 2.4K Apr 16 12:31 permission.pl [root@cjcdb01 db_check]# chmod 755 permission.pl [root@cjcdb01 db_check]# ./permission.pl /oracle Following log files are generated logfile : permission-Fri-Apr-16-12-34-00-2021 Command file : restore-perm-Fri-Apr-16-12-34-00-2021.cmd Linecount : 438065 [root@cjcdb01 db_check]# ll -rth total 113M -rwxr-xr-x 1 root root 2.4K Apr 16 12:31 permission.pl -rw-r--r-- 1 root root 75M Apr 16 12:36 restore-perm-Fri-Apr-16-12-34-00-2021.cmd -rw-r--r-- 1 root root 39M Apr 16 12:36 permission-Fri-Apr-16-12-34-00-2021
将生成的文件拷贝到节点2
修改部分内容后执行
[root@cjcdb02 db_check]# vi restore-perm-Fri-Apr-16-12-34-00-2021.cmd :%s/cjcdb01/cjcdb02/g [root@cjcdb02 db_check]# chmod 755 restore-perm-Fri-Apr-16-12-34-00-2021.cmd [root@cjcdb02 db_check]# ./restore-perm-Fri-Apr-16-12-34-00-2021.cmd
执行脚本修复权限过程中,查看文件权限逐渐恢复,告警日志
[root@cjcdb02 db_check]# tail -f /oracle/crs/log/cjcdb02/agent/crsd/oraagent_grid//oraagent_grid.log 2021-04-16 18:47:23.633: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:542 2021-04-16 18:47:53.633: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:546 2021-04-16 18:48:01.024: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib getEnvVar oracle_home:/oracle/crs 2021-04-16 18:48:01.024: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib oracle_home:/oracle/crs 2021-04-16 18:48:01.024: [ora.ons][3463853824]{2:31203:2} [check] Utils:execCmd action = 3 flags = 6 ohome = /oracle/crs/opmn/ cmdname = onsctli. 2021-04-16 18:48:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:)ons is running ... 2021-04-16 18:48:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:) 2021-04-16 18:48:01.127: [ora.ons][3463853824]{2:31203:2} [check] execCmd ret = 0 2021-04-16 18:48:23.634: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:550 2021-04-16 18:48:31.024: [CLSFRAME][3590973248] TM [MultiThread] is changing desired thread # to 2. Current # is 3 2021-04-16 18:48:31.024: [CLSFRAME][3472258816]{0:1:26} Worker thread is exiting in TM [MultiThread] to meet the desired count of 2. New count is 2 2021-04-16 18:48:53.634: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:554 2021-04-16 18:49:01.025: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib getEnvVar oracle_home:/oracle/crs 2021-04-16 18:49:01.025: [ora.ons][3463853824]{2:31203:2} [check] Utils::getOracleHomeAttrib oracle_home:/oracle/crs 2021-04-16 18:49:01.025: [ora.ons][3463853824]{2:31203:2} [check] Utils:execCmd action = 3 flags = 6 ohome = /oracle/crs/opmn/ cmdname = onsctli. 2021-04-16 18:49:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:)ons is running ... 2021-04-16 18:49:01.127: [ora.ons][3463853824]{2:31203:2} [check] (:CLSN00010:) 2021-04-16 18:49:01.127: [ora.ons][3463853824]{2:31203:2} [check] execCmd ret = 0 2021-04-16 18:49:01.127: [CLSFRAME][3590973248] TM [MultiThread] is changing desired thread # to 3. Current # is 2 2021-04-16 18:49:23.635: [ AGFW][3470157568]{2:31203:2} Agent received the message: AGENT_HB[Engine] ID 12293:558 [root@cjcdb02 ~]# ls -l /oracle/crs/opmn/bin/onsctli -rwx------ 1 grid oinstall 13760 Jan 9 2020 /oracle/crs/opmn/bin/onsctli
执行完脚本后,可以正常重启crs和db实例
[root@cjcdb02 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
crs启动日志如下:
2021-04-16 18:52:36.699: [ohasd(3331)]CRS-2112:The OLR service started on node cjcdb02. 2021-04-16 18:52:36.704: [ohasd(3331)]CRS-1301:Oracle High Availability Service started on node cjcdb02. 2021-04-16 18:52:36.704: [ohasd(3331)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred 2021-04-16 18:52:39.979: [/oracle/crs/bin/orarootagent.bin(3423)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 2021-04-16 18:52:44.379: [gpnpd(3566)]CRS-2328:GPNPD started on node cjcdb02. 2021-04-16 18:52:46.727: [cssd(3640)]CRS-1713:CSSD daemon is started in clustered mode 2021-04-16 18:52:48.562: [ohasd(3331)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE 2021-04-16 18:53:10.861: [cssd(3640)]CRS-1707:Lease acquisition for node cjcdb02 number 2 completed 2021-04-16 18:53:12.132: [cssd(3640)]CRS-1605:CSSD voting file is online: /dev/asm-ocr1; details in /oracle/crs/log/cjcdb02/cssd/ocssd.log. 2021-04-16 18:53:12.141: [cssd(3640)]CRS-1605:CSSD voting file is online: /dev/asm-ocr3; details in /oracle/crs/log/cjcdb02/cssd/ocssd.log. 2021-04-16 18:53:12.153: [cssd(3640)]CRS-1605:CSSD voting file is online: /dev/asm-ocr2; details in /oracle/crs/log/cjcdb02/cssd/ocssd.log. 2021-04-16 18:53:17.591: [cssd(3640)]CRS-1601:CSSD Reconfiguration complete. Active nodes are cjcdb01 cjcdb02 . 2021-04-16 18:53:19.987: [ctssd(4180)]CRS-2407:The new Cluster Time Synchronization Service reference node is host cjcdb01. 2021-04-16 18:53:19.987: [ctssd(4180)]CRS-2401:The Cluster Time Synchronization Service started on host cjcdb02. 2021-04-16 18:53:21.724: [ohasd(3331)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE 2021-04-16 18:53:28.016: [ctssd(4180)]CRS-2403:The Cluster Time Synchronization Service on host cjcdb02 is in observer mode. 2021-04-16 18:53:28.989: [ctssd(4180)]CRS-2409:The clock on host cjcdb02 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2021-04-16 18:53:42.006: [crsd(4517)]CRS-1012:The OCR service started on node cjcdb02. 2021-04-16 18:53:42.017: [evmd(4254)]CRS-1401:EVMD started on node cjcdb02. 2021-04-16 18:53:43.061: [crsd(4517)]CRS-1201:CRSD started on node cjcdb02.
实例也可以正常启动
[root@cjcdb02 bin]# su - oracle Last login: Fri Apr 16 18:30:56 CST 2021 on pts/0 [oracle@cjcdb02 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Fri Apr 16 18:54:22 2021 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to an idle instance. SQL> startup ORACLE instance started. Total System Global Area 3724607488 bytes Fixed Size 2258760 bytes Variable Size 771754168 bytes Database Buffers 2936012800 bytes Redo Buffers 14581760 bytes Database mounted. Database opened. SQL> select count(*),inst_id from gv$session group by inst_id; COUNT(*) INST_ID ---------- ---------- 42 1 42 2
注意:
虽然本次案例节点2可以正常恢复,但是在实际生产环境中,同一套RAC下不同节点的文件及文件权限可能千差万别,生产操作一定要有备份,能回退。