środa, 11 maja 2011

Short story about RULE optimizer in Oracle 9i

OS: AIX 5300-05
Oracle: 9.2.0.8

My colleague received an incident that RMAN backup level 0 didn`t run for almost 2 weeks for production database and decided to run manually.

The database has about 300GB so it is not a big deal run the backup during the day.

I don`t understand why people thing that backup has to be done only in the night. But never mind.

He executed the backup script and looked into a log.


Tue May 10 15:40:08 CEST 2011
-- Day : 2 - Field : 6 --
-- CONNECT AS : target / nocatalog
-- TDP OPT FILE: /XXX/app/oracle/etc/tdp/XXX.opt
-- TAPE: 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/XXX/app/oracle/etc/tdp/XXX.opt)’

And nothing more. We were waiting and waiting but there was no progress.

It seems that a backup script was stuck somewhere.
I checked the sessions on the database and there was not 'rman'  at all.
The background process which was executed was running but there was no progress at all.

Even a truss command didn`t show the reason

3674300: 3907729: __libc_sbrk(0x0000000000110020)        = 0x00000000103D4DE0
6922378: 10072253: kread(3, "1001 Ó p / ñ ä\0  \0 f `".., 4096) (sleeping...)
8634422: 7528505: kread(0, "\0\0\0\0\0\0\0\0\0\0\0\0".., 4096) (sleeping...)
8319216: 11698395: kread(10, "02 æ\0\006\0\0\0\0\0101D".., 2064) (sleeping...)


OK. So first of all let`s check using topas what is going on at the system.




oraXXX[XXX](/home/oraXXX)$ ps auxw | grep 7073904 | grep -v grep
oraXXX  7073904  6.7  0.0 51524 33220      - A    15:40:14  2:15 oracleXXX (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))


LOCAL=YES, so it was run from a server.

sys@XXX> select vs.username, vs.program, vs.status,
to_char(vs.logon_time,'YYYY.MM.DD HH24:MI:SS') LOGON_TIME,
vs.sid, vs.serial#, vp.spid bg_pid
from v$session vs, v$process vp
where vs.paddr(+) = vp.addr and vp.spid=7073904;

USERNAME   PROGRAM          STATUS   LOGON_TIME          SID  SERIAL# BG_PID
---------- ---------------- -------- ------------------- ---- ---------------
SYS        sqlplus@XXXXXXXX ACTIVE   2011.05.10 15:40:14 1558 14384   7073904
           XXXXXXX (TNS V1-
           V3)


Let`s have a look what this session is doing.

sys@XXX> SELECT a.SQL_TEXT FROM v$sqlarea a, v$session b
             WHERE a.ADDRESS=b.SQL_ADDRESS AND b.serial#='14384';

SQL_TEXT
------------- ----------------------------------------------------------------select decode(count(s.sid),0,'ENQUEUEFREE') from    v$session     s,    v$process     p,    v$transaction t,    v$rollstat    r,    v$rollname    n where s.paddr = p.addr and   s.taddr = t.addr (+) and   t.xidusn = r.usn (+) and   r.usn = n.usn (+) and s.username is not null and s.sid = (select s.sid from v$session s, v$enqueue_lock l where l.sid = s.sid and l.type = 'CF' and l.id1 = 0 and l.id2 =
order by 1


Hmm, is this SQL is a part of the backup script ?


view /XXX/local/oracle/bin/rman.sh
...
check_enqueue()
{
sqlplus -s "/ as sysdba" <<EOF | grep [A-Z]
set heading off
set feedback off
set timing off

select
decode(count(s.sid),0,'ENQUEUEFREE')
from
v\$session     s,
   v\$process     p,
   v\$transaction t,
   v\$rollstat    r,
   v\$rollname    n
where s.paddr = p.addr
and   s.taddr = t.addr (+)
and   t.xidusn = r.usn (+)
and   r.usn = n.usn (+)
and s.username is not null
and s.sid = (select s.sid
from v\$session s, v\$enqueue_lock l
where l.sid = s.sid and l.type = 'CF' and l.id1 = 0 and l.id2 = 2)
order by 1
;
EOF
}


waitEnqueue()
{
   # checking if the controlfile enqueue is available
   if [ "`check_enqueue`" != "ENQUEUEFREE"  ]
   then
        echo "Sleeping 60 seconds: controlfile enqueue not available" | tee -a ${LOGFILE}
        sleep 60
        waitEnqueue
   else
        doBackup
   fi
}


Bingo. Let`s see what the session is waiting for ?

No idle waiting events for this session in the  v$session_wait. Only topas shows that a session is consuming a CPU.

Let`s trace the sql.

SQL> alter session set timed_statistics = true;
SQL> alter session set statistics_level=ALL;
SQL> alter session set max_dump_file_size=unlimited;
SQL> alter session set tracefile_identifier='without_rule'
SQL> alter session set events = '10046 trace name context forever,              level 12 : 10053 trace name context forever, level 1';
SQL> select
decode(count(s.sid),0,'ENQUEUEFREE')
from
   v$session     s,
   v$process     p,
   v$transaction t,
   v$rollstat    r,
   v$rollname    n
where s.paddr = p.addr
and   s.taddr = t.addr (+)
and   t.xidusn = r.usn (+)
and   r.usn = n.usn (+)
and s.username is not null
and s.sid = (select s.sid
from v$session s, v$enqueue_lock l
where l.sid = s.sid and l.type = 'CF' and l.id1 = 0 and l.id2 = 2)
order by 1;
SQL> select * from dual;
SQL> alter session set events = '10046 trace name context off : 10053 trace name context off ';
SQL> exit;

After using tkprof I have

select
decode(count(s.sid),0,'ENQUEUEFREE')
from
   v$session     s,
   v$process     p,
   v$transaction t,
   v$rollstat    r,
   v$rollname    n
where s.paddr = p.addr
and   s.taddr = t.addr (+)
and   t.xidusn = r.usn (+)
and   r.usn = n.usn (+)
and s.username is not null
and s.sid = (select s.sid
from v$session s, v$enqueue_lock l
where l.sid = s.sid and l.type = 'CF' and l.id1 = 0 and l.id2 = 2)
order by 1

call     count       cpu    elapsed  disk    query    current     rows
------- ------  -------- ---------- ------  -------  ---------  -------
Parse        1      0.05       0.06   0        4        0           0
Execute      1      0.00       0.00   0        0        0           0
Fetch        2    599.16     785.91   0        0        0           1
------- ------  -------- ---------- ------  -------  ---------  -------
total        4    599.21     785.98   0        4        0           1

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=0 r=0 w=0 time=785917722 us)
      0   HASH JOIN OUTER (cr=0 r=0 w=0 time=785917712 us)
      0    NESTED LOOPS  (cr=0 r=0 w=0 time=785917251 us)
      0     HASH JOIN OUTER (cr=0 r=0 w=0 time=785917249 us)
      0      HASH JOIN OUTER (cr=0 r=0 w=0 time=785916653 us)
      0       FIXED TABLE FULL X$KSUSE (cr=0 r=0 w=0 time=785915416 us)
      0       FIXED TABLE FULL X$KTCXB
      0      FIXED TABLE FULL X$KTURD
      0     FIXED TABLE FIXED INDEX X$KSUPR (ind:1)
      0    VIEW
      0     NESTED LOOPS
      0      FIXED TABLE FULL X$KTURD
      0      INDEX UNIQUE SCAN I_UNDO1 (object id 34)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2      328.41        328.41

To remind you how to interpret TKPROF output (follow the Metalink note: How To Use SQL Trace And TKPROF For Performance Issues [ID 980711.1])


PARSE: Translates the SQL statement into an execution plan, including checks for proper security authorization and checks for the existence of tables, columns, and other referenced objects.
EXECUTE: Actual execution of the statement by Oracle. For INSERT, UPDATE, and DELETE statements, this modifies the data. For SELECT statements, this identifies the selected rows. 

FETCH: Retrieves rows returned by a query. Fetches are only performed for SELECT statements.


SQL Trace Statistic


COUNT: Number of times a statement was parsed, executed, or fetched.
CPU: Total CPU time in seconds for all parse, execute, or fetch calls for the statement. This value is zero (0) if TIMED_STATISTICS is not turned on.
ELAPSED: Total elapsed time in seconds for all parse, execute, or fetch calls for the statement. This value is zero (0) if TIMED_STATISTICS is not turned on.
DISK: Total number of data blocks physically read from the datafiles on disk for all parse, execute, or fetch calls.
QUERY: Total number of buffers retrieved in consistent mode for all parse, execute, or fetch calls.     Usually, buffers are retrieved in consistent mode for queries.
CURRENT: Total number of buffers retrieved in current mode. Buffers are retrieved in current mode for statements such as INSERT, UPDATE, and DELETE.

Let`s have a look at tkprof output. We have
FETCH/CPU 599.21 and FETCH/ELAPSED 785.98. This select is very costly.

I try to force using a RULE optimizer in this select.

SQL> select /*+ RULE */
decode(count(s.sid),0,'ENQUEUEFREE')
from
   v$session     s,
   v$process     p,
   v$transaction t,
   v$rollstat    r,
   v$rollname    n
where s.paddr = p.addr
and   s.taddr = t.addr (+)
and   t.xidusn = r.usn (+)
and   r.usn = n.usn (+)
and s.username is not null
and s.sid = (select s.sid
from v$session s, v$enqueue_lock l
where l.sid = s.sid and l.type = 'CF' and l.id1 = 0 and l.id2 = 2)
order by 1;


Let`s see an output from TKPROF.

select /*+ RULE */
decode(count(s.sid),0,'ENQUEUEFREE')
from
   v$session     s,
   v$process     p,
   v$transaction t,
   v$rollstat    r,
   v$rollname    n
where s.paddr = p.addr
and   s.taddr = t.addr (+)
and   t.xidusn = r.usn (+)
and   r.usn = n.usn (+)
and s.username is not null
and s.sid = (select s.sid
from v$session s, v$enqueue_lock l
where l.sid = s.sid and l.type = 'CF' and l.id1 = 0 and l.id2 = 2)
order by 1

call     count       cpu    elapsed   disk   query   current    rows
------- ------  -------- ---------- ------- ------- ---------  ------
Parse        1      0.01       0.01    0      2          0        0
Execute      1      0.00       0.00    0      0          0        0
Fetch        2      0.30       0.38    0      2       1258        1
------- ------  -------- ---------- ------- ------- ---------  ------
total        4      0.31       0.39    0      4       1258        1      

Misses in library cache during parse: 1
Optimizer goal: RULE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=2 r=0 w=0 time=381085 us)
      0   FILTER  (cr=2 r=0 w=0 time=381075 us)
   1440    MERGE JOIN OUTER (cr=2 r=0 w=0 time=169179 us)
   1440     SORT JOIN (cr=0 r=0 w=0 time=165098 us)
   1440      MERGE JOIN OUTER (cr=0 r=0 w=0 time=158664 us)
   1440       SORT JOIN (cr=0 r=0 w=0 time=155327 us)
   1440        MERGE JOIN OUTER (cr=0 r=0 w=0 time=148218 us)
   1440         SORT JOIN (cr=0 r=0 w=0 time=64458 us)
   1440          MERGE JOIN  (cr=0 r=0 w=0 time=56813 us)
   1446           FIXED TABLE FULL X$KSUPR (cr=0 r=0 w=0 time=13101 us)
   1440           SORT JOIN (cr=0 r=0 w=0 time=41178 us)
   1440            FIXED TABLE FULL X$KSUSE (cr=0 r=0 w=0 time=32526 us)
     40         SORT JOIN (cr=0 r=0 w=0 time=81663 us)
     40          FIXED TABLE FULL X$KTCXB (cr=0 r=0 w=0 time=80434 us)
     40       SORT JOIN (cr=0 r=0 w=0 time=1763 us)
     72        FIXED TABLE FULL X$KTURD (cr=0 r=0 w=0 time=1005 us)
     40     SORT JOIN (cr=2 r=0 w=0 time=1670 us)
     72      VIEW  (cr=2 r=0 w=0 time=958 us)
     72       NESTED LOOPS  (cr=2 r=0 w=0 time=924 us)
     72        FIXED TABLE FULL X$KTURD (cr=0 r=0 w=0 time=649 us)
     72        INDEX UNIQUE SCAN I_UNDO1 (cr=2 r=0 w=0 time=143 us)(object id 34)
      0    MERGE JOIN  (cr=0 r=0 w=0 time=211198 us)
      0     SORT JOIN (cr=0 r=0 w=0 time=211196 us)
      0      MERGE JOIN  (cr=0 r=0 w=0 time=211183 us)
      0       SORT JOIN (cr=0 r=0 w=0 time=211182 us)
      0        MERGE JOIN  (cr=0 r=0 w=0 time=211168 us)
      3         SORT JOIN (cr=0 r=0 w=0 time=64881 us)
      3          FIXED TABLE FULL X$KSQRS (cr=0 r=0 w=0 time=64826 us)
      0         SORT JOIN (cr=0 r=0 w=0 time=146249 us)
    663          FIXED TABLE FULL X$KSQEQ (cr=0 r=0 w=0 time=145238 us)
      0       SORT JOIN
      0        FIXED TABLE FULL X$KSUSE
      0     SORT JOIN
      0      FIXED TABLE FULL X$KSUSE


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        8.38          8.38


What a difference ! Now I have
FETCH/CPU 0.31 and FETCH/ELAPSED 0.39 so it is almost 2000 times better than before.

Brak komentarzy:

Prześlij komentarz