数据库:Oracle 9.2.0.8
平台:HP-UX 11.11
在对客户的数据库进行健康检查时发现:
TOP 5等待事件中最高的是enqueue,其次是PX Deq: Execute Reply,这两个等待事件占了所有事件的65%以上。
等待事件“enqueue”所占比例最高,一共等待了14,487s的时间,但从后面的enqueue的统计信息来看,所统计的SQ enqueue和TX enqueue的等待时间总和只有49s,其他统计的enqueue等待时间基本为0s,所以还有其他未统计进statspack的enqueue存在,并占据了大量了等待时间。
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
enqueue 6,339 14,487 44.02
PX Deq: Execute Reply 2,773 7,222 21.94
CPU time 5,314 16.15
db file sequential read 982,418 3,574 10.86
row cache lock 80,167 691 2.10
-------------------------------------------------------------
Enqueue activity for DB: IBSS Instance: ibss2 Snaps: 311 -312
-> Enqueue stats gathered prior to 9i should not be compared with 9i data
-> ordered by Wait Time desc, Waits desc
Avg Wt Wait
Eq Requests Succ Gets Failed Gets Waits Time (ms) Time (s)
-- ------------ ------------ ----------- ----------- ------------- ------------
SQ 5,589 5,589 0 206 181.50 37
TX 189,789 189,398 391 163 76.10 12
HW 1,601 1,599 2 423 .55 0
US 1,949 1,949 0 339 .67 0
TM 912,315 912,254 56 155 .36 0
TT 638 638 0 48 .92 0
FB 39 39 0 26 .31 0
TD 16 16 0 16 .50 0
IR 10 10 0 8 .63 0
WL 8 6 2 6 .67 0
DR 4 4 0 4 .50 0
RO 3 3 0 3 .67 0
TA 26 26 0 2 .50 0
DL 123 123 0 1 1.00 0
TS 6 6 0 5 .00 0
CF 12,113 12,113 0 1 .00 0
-------------------------------------------------------------
采样查询等待事件为enqueue的session:
SQL> select sid,status,program,sql_hash_value from v$session where sid in (select sid from v$session_wait where event='enqueue');
SID STATUS PROGRAM SQL_HASH_VALUE
---------- -------- ------------------------- --------------
194 ACTIVE ORACLE.EXE 4166735688
311 ACTIVE oracle@IBSSDB2 (P000) 1611783128
在多个时间点采样发现,enqueue等待事件始终只有两个这session和这两个sql_hash_value。
查找sid=194的sql:
SQL> select sql_text from v$sql where hash_value=4166735688;
no rows selected
sid:194找不到相应的sql。
查找sid=311的sql:
SQL> select sql_text from v$sql where hash_value=1611783128;
SQL_TEXT
--------------------------------------------------------------------------------
SELECT /*+ Q2447000 */ A1."INST_ID",A1."FORMATID",A1."GLOBALID",A1."BRANCHID",A1
."REFCOUNT" FROM "SYS"."GV$GLOBAL_TRANSACTION" A1 WHERE A1."FORMATID"=:B1 AND (R
AWTOHEX(A1."GLOBALID")=:B2 OR A1."GLOBALID"=:B3) AND (RAWTOHEX(A1."BRANCHID")=:B
4 OR A1."BRANCHID"=:B5)
通过V$LOCK分析锁:
SESS ID1 ID2 LMODE REQUEST TY
---------------- ---------- ---------- ---------- ---------- --
Holder: 217 18 0 4 0 DX
Waiter: 311 18 0 0 4 DX
Waiter: 194 18 0 0 6 DX
在锁的等待中,持有者是217,等待者是311和192,等待的锁为DX锁。
DX锁:Distributed transaction entry
通过用oradebug分析,同样发现session 311被session 194阻塞,而session 194被session 217阻塞。
Open chains found:
Chain 1 : <cnode/sid/sess_srno/ospid/wait_event> :
<1/217/952/11188/SQL*Net more data to client>
-- <1/194/20578/13616/enqueue> -- <1/311/44354/24637/enqueue>
关于这三个session的信息:
SQL> select sid,USERNAME,STATUS,MACHINE,OSUSER,PROGRAM,MODULE from v$session where sid in (217,194,311);
SPID SID USERNAME STATUS MACHINE OSUSER PROGRAM MODULE
------ ----- ----------------- -------- ---------- ---------------- ---------------------- ------------
11188 194 IBSSCX ACTIVE DL380G4 Administrator ORACLE.EXE ORACLE.EXE
13616 217 IBSSCX ACTIVE DL380G4 Administrator ORACLE.EXE ORACLE.EXE
24637 311 ORA_RECO_070361 ACTIVE DL380G4 SYSTEM oracle@IBSSDB2 (P000)
SID
---------- ---------
217
194
311
oracle9i 11188 1 0 Jul 19 ? 0:52 oracleibss2 (LOCAL=NO)
oracle9i 13616 1 0 Jul 19 ? 8:55 oracleibss2 (LOCAL=NO)
oracle9i 24637 1 0 Aug 16 ? 6:42 ora_p000_ibss2
检查session 217:
SQL> select sid,status,program,sql_hash_value from v$session where sid=217;
SID STATUS PROGRAM SQL_HASH_VALUE
---------- -------- ------------------------- --------------
217 ACTIVE ORACLE.EXE 4198533449
SQL> select event from v$session_wait where sid=217;
EVENT
----------------------------------------------------------------
SQL*Net more data to client
SQL> select sql_text from v$sql where hash_value=4198533449;
SQL_TEXT
--------------------------------------------------------------------------------
SELECT "MSINFO_ID","DISC_ID" FROM "TB_CM_MSINFO" "AA"
SQL> select count(*) from IBSSCX.TB_CM_MSINFO;
COUNT(*)
----------
1780102
检查session 217后发现,该session的status始终为ACTIVE,等待事件为:SQL*Net more data to client
sql为一条简单的select语句,并且查询的表只有1780102行数据,怀疑该进程僵死。
相关的"PX Deq: Execute Reply"等待事件分析:
select * from
(select p.spid,s.sid,s.program,s.sql_hash_value,s.status,
to_char(s.LOGON_TIME,'yyyy-mm-dd HH24:MI:SS'),s.LAST_CALL_ET,
w.event
from v$session s,v$session_wait w,v$process p where s.type='USER'
and s.sid=w.sid and p.addr=s.paddr
order by s.LOGON_TIME)
where rownum<=10;
SPID SID PROGRAM SQL_HASH_VALUE STATUS TO_CHAR(S.LOGON_TIM LAST_CALL_ET EVENT
------ ----- --------------------------------- -------------- -------- ------------------- ------------ ------------------------------
19312 14 oracle@IBSSDB2 (CJQ0) 0 ACTIVE 2009-07-09 23:01:37 16121061 rdbms ipc message
26267 182 ORACLE.EXE 3477857362 INACTIVE 2009-07-19 14:30:33 15284881 SQL*Net message from client
11188 217 ORACLE.EXE 4198533449 ACTIVE 2009-07-19 15:17:55 15284427 SQL*Net more data to client
13616 194 ORACLE.EXE 4166735688 ACTIVE 2009-07-19 15:25:37 15284414 enqueue
24637 311 oracle@IBSSDB2 (P000) 1611783128 ACTIVE 2009-08-16 10:18:07 12883664 enqueue
24632 304 ORACLE.EXE 1963934439 ACTIVE 2009-08-16 10:18:07 12883665 PX Deq: Execute Reply
3261 68 JDBC Thin Client 3314230268 INACTIVE 2009-11-18 08:54:56 1125959 SQL*Net message from client
2848 40 JDBC Thin Client 588438348 INACTIVE 2009-11-20 11:35:25 4548 SQL*Net message from client
4147 335 SAPP2">rmrescheck@IBSSAPP2 (TNS V1-V3) 0 INACTIVE 2009-12-01 11:53:59 2491 SQL*Net message from client
4159 52 rmibssengine@IBSSAPP2 (TNS V1-V3) 3206692553 INACTIVE 2009-12-01 11:54:04 21 SQL*Net message from client
从session状态、运行时间及等待事件来看,存在问题的session有:
SPID SID
------------ ----------
11188 217
13616 194
24637 311
24632 304
已知 217、194、311在一条等待链上,而session 304的等待事件是"PX Deq: Execute Reply",在等待所有其他执行子进程完成,而session 311正是并行执行子进程,在整个等待事件统计中,只有1个"PX Deq: Execute Reply"等待事件,所以该等待事件也是由enqueue引发,根本原因也是session 217 hang住导致的。
建议检查session 217是否确实僵死,若僵死,建议将其kill