问题背景
有一段时间,某数据库在整点期间总是会发出几条连接失败的告警,如下
2019-08-14 10:00:22, 29.211.?.?/??db can not connect, Warning from 5.223-mon_pri_conn.sh! |
分析过程
1.分析监听日志发现监听存在重启行为
14-AUG-2019 10:00:16 * (CONNECT_DATA=(SERVICE_NAME=dbinst1)(CID=(PROGRAM=sqlplus)(HOST=localhost.bss_rh6_dqjkj01_35.176)(USER=ftpuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=29.211.35.176)(PORT=19573)) * establish * dbinst1 * 0 2019-08-14T10:00:17.907210+08:00 14-AUG-2019 10:00:17 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVICE_NAME=dbinst1)) * (ADDRESS=(PROTOCOL=tcp)(HOST=29.211.177.93)(PORT=63570)) * establish * dbinst1 * 12518 TNS-12518: TNS:listener could not hand off client connection <—–10:00:17监听开始异常 TNS-12540: TNS:internal limit restriction exceeded 14-AUG-2019 10:00:17 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVICE_NAME=dbinst1)) * (ADDRESS=(PROTOCOL=tcp)(HOST=29.211.31.19)(PORT=43753)) * establish * dbinst1 * 12518 TNS-12518: TNS:listener could not hand off client connection TNS-12540: TNS:internal limit restriction exceeded 14-AUG-2019 10:00:17 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVICE_NAME=dbinst1)) * (ADDRESS=(PROTOCOL=tcp)(HOST=29.211.35.160)(PORT=15812)) * establish * dbinst1 * 12518 TNS-12518: TNS:listener could not hand off client connection TNS-12540: TNS:internal limit restriction exceeded….14-AUG-2019 10:00:20 * 12582 TNS-12582: TNS:invalid operation TNS-12615: TNS:preempt error Error listening on: (ADDRESS=(PROTOCOL=ipc)(PARTIAL=yes)(QUEUESIZE=1)) 14-AUG-2019 10:00:21 * establish * 1159 TNS-01159: Internal connection limit has been reached; listener has shut down <—10:00:21监听自动重启 TNS-12540: TNS:internal limit restriction exceeded No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=29.211.16.112)(PORT=1521))) No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER))) No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=29.211.16.131)(PORT=1521))) 2019-08-14T10:00:46.748331+08:00 System parameter file is /u01/app/12.2.0/grid/network/admin/listener.ora—异常前每秒短连接 11 14-AUG-2019 09:59:55 1 14-AUG-2019 09:59:56 1 14-AUG-2019 09:59:57 1 14-AUG-2019 09:59:59 18 14-AUG-2019 10:00:00 16 14-AUG-2019 10:00:01 13 14-AUG-2019 10:00:02 4 14-AUG-2019 10:00:03 6 14-AUG-2019 10:00:04 10 14-AUG-2019 10:00:05 <—–监听重启前存在大量的短连接行为 53 14-AUG-2019 10:00:06 35 14-AUG-2019 10:00:07 47 14-AUG-2019 10:00:08 65 14-AUG-2019 10:00:09 105 14-AUG-2019 10:00:10 61 14-AUG-2019 10:00:11 31 14-AUG-2019 10:00:12 38 14-AUG-2019 10:00:13 5 14-AUG-2019 10:00:15 3 14-AUG-2019 10:00:16—异常整个期间数据库等待事件正常,无明显高消耗 |
2.排查问题期间的os负载情况
OSWBB_ARCHIVE_DEST /u01/osw/oswbb/archivezzz ***Wed Aug 14 10:00:10 CST 2019procs ———–memory———- —swap– —–io—- –system– —–cpu—– r b swpd free buff cache si so bi bo in cs us sy id wa st31 6 0 446655712 33212 1998720 0 0 1 3 0 0 1 1 98 0 0 21 22 0 446779968 35236 2008236 0 0 16300 96 259964 240440 11 3 83 3 0 25 27 0 446795360 37040 2020368 0 0 29096 48 235608 212541 11 3 82 4 0 zzz ***Wed Aug 14 10:00:24 CST 2019 procs ———–memory———- —swap– —–io—- –system– —–cpu—– r b swpd free buff cache si so bi bo in cs us sy id wa st 6 18 0 443893696 72420 2204804 0 0 1 3 0 0 1 1 98 0 0 13 12 0 444055840 74432 2228936 0 0 22880 16 90258 85804 5 2 88 6 0 6 17 0 444172864 76036 2250128 0 0 19980 100 70041 61176 4 3 86 7 0 zzz ***Wed Aug 14 10:00:36 CST 2019 procs ———–memory———- —swap– —–io—- –system– —–cpu—– r b swpd free buff cache si so bi bo in cs us sy id wa st 6 20 0 444360928 83052 2341764 0 0 1 3 0 0 1 1 98 0 0 6 24 0 444552928 83740 2098432 0 0 4436 1212 67242 61741 4 3 85 8 0 125 220 0 444898240 84812 2102292 0 0 3892 1544 174504 140362 12 13 55 19 0 1.10秒的时候 r21 b22, 该机器为ssd存储,平常cpu-b不到5,这里值已经有点异常2.36秒的时候出现r 125 b220,可能为监听重启积压导致,且异常时间点在监听重启后,不能确认是该问题导致故障。 |
top – 10:00:36 up 288 days, 14:24, 2 users, load average: 215.59, 68.64, 29.25 Tasks: 8823 total, 15 running, 8808 sleeping, 0 stopped, 0 zombie Cpu(s): 8.4%us, 2.2%sy, 0.0%ni, 88.6%id, 0.6%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 1058551988k total, 611291252k used, 447260736k free, 87332k buffers Swap: 20971516k total, 0k used, 20971516k free, 2151908k cached 1.近一分钟load达到215,且较5分钟、15分钟load值出现直线上升。2.通常在linux平台 load avg为系统整体load, 包含 cpu load, io load等等.简单理解为 R+D状态的process,不代表cpu使用率,但load高一定存在某一方面的高负载情况. |
重点排查一下异常期间的ps 数据,排除S(sleep)状态进程,可以看到大量的IO相关D进程[oracle@dbserver1 tmp]$ cat /tmp/ps1024.txt | awk ‘{print $1,$9,$10,$13,$14}’ | sort | uniq -c 454 oracle filp_o D oracledbinst11 (LOCAL=NO) <—–通过排查filp_o为linux 打开文件的内核函数 3 oracle get_wr D oracledbinst11 (LOCAL=NO) 35 oracle lookup D oracledbinst11 (LOCAL=NO) 1 oracle lookup D ora_m001_dbinst11 1 oracle – R ps -aeo 16 oracle sync_b D oracledbinst11 (LOCAL=NO) 1 root ? R /bin/netstat -ap 2 root ? R /bin/ps -p [oracle@dbserver1 tmp]$ [oracle@dbserver1 tmp]$ cat /tmp/ps1024.txt | grep ’10:00′ | awk ‘{print $1,$9,$10,$11,$13,$14}’ | sort | uniq -c 112 oracle filp_o D 10:00:14 oracledbinst11 (LOCAL=NO) <——— 10:00:14建立的短连接 112115 oracle filp_o D 10:00:15 oracledbinst11 (LOCAL=NO) <——— 10:00:15建立的短连接 115 积压的大量短连接积处于filp_o 状态,io可能存在异常 75 oracle filp_o D 10:00:16 oracledbinst11 (LOCAL=NO) 9 oracle filp_o D 10:00:17 oracledbinst11 (LOCAL=NO) 1 oracle filp_o D 10:00:36 oracledbinst11 (LOCAL=NO) 6 oracle lookup D 10:00:12 oracledbinst11 (LOCAL=NO) 38 oracle lookup D 10:00:13 oracledbinst11 (LOCAL=NO) 33 oracle lookup D 10:00:14 oracledbinst11 (LOCAL=NO) 33 oracle lookup D 10:00:15 oracledbinst11 (LOCAL=NO) 47 oracle lookup D 10:00:16 oracledbinst11 (LOCAL=NO) 17 oracle lookup D 10:00:17 oracledbinst11 (LOCAL=NO) 5 oracle lookup D 10:00:18 oracledbinst11 (LOCAL=NO) 排查ash信息再次确认数据库负载较低,io响应也在毫秒级,无明显异常使用strace跟踪测试建立连接期间,发现监听派生 LOCAL=NO进程后会读取大量的ORACLE_HOME本地lib库,并进一步排查/u01文件系统状态[oracle@dbserver1 tmp]$ cat /tmp/stsqlplus.txt | grep ‘12.2.0/db’ 126319 open(“/u01/app/oracle/product/12.2.0/db/lib/libsqlplus.so”, O_RDONLY) = 3 126319 open(“/u01/app/oracle/product/12.2.0/db/lib/libclntsh.so.12.1″, O_RDONLY) = 3 126319 open(“/u01/app/oracle/product/12.2.0/db/lib/libclntshcore.so.12.1″, O_RDONLY) = 3 126319 open(“/u01/app/oracle/product/12.2.0/db/lib/libmql1.so”, O_RDONLY) = 3 126319 open(“/u01/app/oracle/product/12.2.0/db/lib/libipc1.so”, O_RDONLY) = 3 126319 open(“/u01/app/oracle/product/12.2.0/db/lib/libnnz12.so”, O_RDONLY) = 3 [oracle@dbserver1 tmp]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 837.3G 0 disk sda1 8:1 0 1G 0 part /boot sda2 8:2 0 100G 0 part / sda3 8:3 0 20G 0 part [SWAP] sda4 8:4 0 1K 0 part sda5 8:5 0 20G 0 part /opt sda6 8:6 0 20G 0 part /usr sda7 8:7 0 20G 0 part /var/log sda8 8:8 0 200G 0 part /u01 <-sda8 本地盘挂在的$ORACLE_HOME,核查是否存在问题sdb 8:16 0 1.5T 0 disk ….. sdo 8:224 0 30G 0 disk sdp 8:240 0 300G 0 disk sdq 65:0 0 300G 0 disk sdr 65:16 0 2T 0 disk asm!oradata-127 251:65025 0 2T 0 disk zzz ***Wed Aug 14 10:00:11 CST 2019 avg-cpu: %user %nice %system %iowait %steal %idle 11.35 0.00 2.83 3.21 0.00 82.61<——-cpu整体使用率不高,iowait整体也不高Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 169.00 5.00 642.00 8.00 29000.00 52.00 89.39 10.19 35.69 2.54 100.00 <——– sda– /u01–ORACLE_HOME文件系统IO居高不下 await出现明显等待 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 223.00 24.00 5640.00 452.00 49.33 0.08 0.33 0.25 6.10 zzz ***Wed Aug 14 10:00:24 CST 2019 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 176.00 1.00 656.00 25.00 23552.00 104.00 57.73 15.94 43.65 3.47 100.00 <——–sda– /u01–ORACLE_HOME文件系统 iowait持续居高不下 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 40.00 20.00 336.00 3 |
问题原因
1.通过与应用侧沟通了解到在整点期间app营业厅会推出抢购活动导致短连接明显的上升
2.主机管理员反馈由于本地盘RAID出现故障盘导致响应异常,以至于大量短连接挂死在ORACLE_HOME的lib库读取上,达到监听队列上限值后引发监听重启.改进措施
1.推进应用程序短连接进行连接池化整改。
2.更换本地盘为SSD存储提升ORACLE_HOME的读写性能。
注:本次故障版本为oracle12.2,在oracle12c之后的版本中,数据库会持续写出大量的垃圾trace信息不仅影响IO性能且影响文件系统正常使用率,建议单独指定diag目录以及使用SSD存储安装ORACLE_HOME 。
来源:IT那活儿,本文观点不代表自营销立场,网址:https://www.zyxiao.com/p/115559