老熊的三分地-Oracle及数据恢复

五

shared pool latch和library cache latch

老熊 Oracle Internal研究 2011-05-03

我给《DBA 手记III》投的一篇稿子，名为《一次由隐含参数引起性能问题的处理》，这篇文章描述了一套关键的系统，由于"_kghdsidx_count"这个参数设置为1，导致了严重的性能问题。从故障现象上看是大量的library cache latch的等待，以及shared pool latch的等待，但是前者的等待时间比后者长得多。在文章中，我提到，在当时我推断，由于"_kghdsidx_count"这个隐含参数设置为1，导致shared pool只有1个subpool，引起了shared pool latch的严重竞争，进而引起了library cache cache的更为严重的竞争，这个竞争的过程如下：

由于"_kghdsidx_count"=1，使得shared pool latch只有1个child latch。而library cache latch的child latch数量跟CPU数量有关，最大值为67，编号为child #1-#67。
会话1持有shared pool latch。
会话2解析SQL语句，首先持有了library cache latch的child latch，假设这里为child #1，然后去请求shared pool latch。很显然，这时候被会话1持有，那么会话2就会等待shared pool latch。
会话3解析SQL语句，首先需要请求library cache latch，如果请求的library cache child latch刚好是#1，那么由于会话2持有了这个child latch，就会等待library cache latch。
因此，实际上会话1和会话2的shared pool latch的竞争引起了会话3的library cache latch的等待。如果并发数不是太高，那么shared pool latch的竞争看上去就会比library cache latch的竞争多一些。但是如果有几百个活动会话，这个时候，就会有大量的会话首先等待library cache latch，因为在解析SQL时是首先需要获取library cache latch再获取shared pool latch。由于大量的软解析，甚至不需要获取shared pool latch，同时一个大型的OLTP系统中，某几条相同的SQL并发执行的概率很高，这样会使很多会话同时请求同一library cache child latch；另外，在解析过程中，可能会多次请求library cache latch和shared pool latch，而前者请求和释放的次数会比后者多得多；这样大量的会话在获取library cache latch时处于等待状态，从现象上看就比shared pool latch的等待多得多。

而本文主要表达的是，怎么来验证在解析时，Oracle进程在持有了library cache latch的情况下去请求shared pool latch，而不是在请求shared pool时不需要持有library cache latch。

由于这个验证过程过于internal，所以没有在《DBA手记III》中描述出来。这里写出来，供有兴趣的朋友参考。

验证上面的这一点，有2个方法。下面以测试过程来详细描述。
测试环境：Oracle 10.2.0.5.1 for Linux X86.

方法一：使用oradebug。

1. 将数据库的"_kghdsidx_count"参数值设为1，并重启数据库，以保证只有一个shared pool child latch。
2. 使用sqlplus连接到数据库，假设这个会话是session-1，查询当前的SID:

SQL> select sid from v$mystat where rownum=1;

       SID
----------
       159

同时获得当前连接的spid为2415。
3. 新建一个连接到数据库，假设会话是session-2，查询shared pool latch的地址，并使用oradebug将这个地址对应的值置为1，以表示该latch已经被持有：


SQL> select addr,latch#,level#,child#,name,gets  from v$latch_children where name='shared pool';

ADDR         LATCH#     LEVEL#     CHILD# NAME                                                     GETS
-------- ---------- ---------- ---------- -------------------------------------------------- ----------
200999BC        216          7          1 shared pool                                             34949
20099A20        216          7          2 shared pool                                                 6
20099A84        216          7          3 shared pool                                                 6
20099AE8        216          7          4 shared pool                                                 6
20099B4C        216          7          5 shared pool                                                 6
20099BB0        216          7          6 shared pool                                                 6
20099C14        216          7          7 shared pool                                                 6

SQL> oradebug poke 0x200999BC 4 1
BEFORE: [200999BC, 200999C0) = 00000000
AFTER:  [200999BC, 200999C0) = 00000001

4. 在session-1会话中执行下面的SQL：

SQL> select owner from t where object_id=3003;

正如预料之中的反映，这个会话hang住。

5. 在session-2中，对session-1的进程作process dump。（注意这个时候不能查询v$session_wait、v$latchholder等视图）

SQL> oradebug setospid 2415
Oracle pid: 15, Unix process pid: 2415, image: oracle@xty (TNS V1-V3)
SQL> oradebug dump processstate 10
Statement processed.
SQL> oradebug tracefile_name
/oracle/app/oracle/admin/xty/udump/xty_ora_2415.trc

然后从/oracle/app/oracle/admin/xty/udump/xty_ora_2415.trc这个TRACE文件中可以找到下面的信息：


Process global information:
     process: 0x51a2dce8, call: 0x51b465e4, xact: (nil), curses: 0x51b2a4b8, usrses: 0x51b2a4b8
  ----------------------------------------
  SO: 0x51a2dce8, type: 2, owner: (nil), flag: INIT/-/-/0x00
  (process) Oracle pid=15, calls cur/top: 0x51b465e4/0x51b465e4, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 110 0 4
              last post received-location: kslpsr
              last process to post me: 51a2a904 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 51a2a904 1 6
    (latch info) wait_event=0 bits=20
        Location from where call was made: kghupr1: Chunk Header
        Context saved from call: 1306572176
      waiting for 200999bc Child shared pool level=7 child#=1 
        Location from where latch is held: kghupr1: Chunk Header
        Context saved from call: 1299065692
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           10 (54, 1304393092, 54)
           11 (54, 1304393092, 54)
           24 (15, 1304393092, 15)
           15 (6, 1304393092, 6)
           waiter count=4
          gotten 35138 times wait, failed first 502 sleeps 5
          gotten 0 times nowait, failed: 0
      on wait list for 200999bc
      holding    (efd=3) 4f7c75f8 Child library cache level=5 child#=1
        Location from where latch is held: kgllkdl: child: no lock handle: latch
        Context saved from call: 2
        state=busy, wlstate=free

从waiting for 200999bc Child shared pool level=7 child#=1和holding (efd=3) 4f7c75f8 Child library cache level=5 child#=1可以看到，进程的确是在持有了library cache latch的同时去请求 shared pool latch。

最后使用“oradebug poke 0x200999BC 4 0”命令将shared pool latch置为free状态。

方法二：使用gdb。
1. 还原"_kghdsidx_count"参数，重启数据库。
2. 使用sqlplus连接到数据库，获取对应的Oracle Process的spid为4763。
3. 使用gdb：

[oracle@xty ~]$ gdb -p 4763
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
...省略部分输出...
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_files.so.2
0x0059b402 in __kernel_vsyscall ()

4. 在gdb中，在latch的获取和释放的函数上设置断点：

(gdb) b kslgpl
Breakpoint 1 at 0x8339fca
(gdb) b kslgetsl
Breakpoint 2 at 0x833a2d5
(gdb) b kslgpl
Breakpoint 3 at 0x83473bf
(gdb) b kslg2c
Breakpoint 4 at 0x8347886
(gdb) b kslg2csl
Breakpoint 5 at 0x834794e
(gdb) b kslgprl
Breakpoint 6 at 0x8346186
(gdb) 
Note: breakpoint 6 also set at pc 0x8346186.
Breakpoint 7 at 0x8346186
(gdb) b kslfre
Breakpoint 8 at 0x833c05d
(gdb) b kslf2c
Breakpoint 9 at 0x8347a1e
(gdb) b kslfpl
Breakpoint 10 at 0x8347812
(gdb) b kslfal
Breakpoint 11 at 0x833fbca
(gdb) b kslfpl0
Breakpoint 12 at 0x834766d
(gdb) 
注：最后几个函数只是猜测跟latch有关....latch的请求和释放所调用的函数主要为：kslgetl、kslgetsl和kslfre。
(gdb) c
Continuing.

5. 在sqlplus中执行SQL：

SQL> select * from t where object_id=10000;

正如预期的那样，会话hang住。这是因为运行到了gdb设置的断点。

6.在gdb中可以看到：

Breakpoint 2, 0x0833a2d5 in kslgetsl ()
(gdb) info f
Stack level 0, frame at 0xbfd3fb08:
 eip = 0x833a2d5 in kslgetsl; saved eip 0x84257e1
 called by frame at 0xbfd3fb44
 Arglist at 0xbfd3fb00, args: 
 Locals at 0xbfd3fb00, Previous frame's sp is 0xbfd3fb08
 Saved registers:
  ebp at 0xbfd3fb00, eip at 0xbfd3fb04
(gdb) x /10x 0xbfd3fb00
0xbfd3fb00:     0xbfd3fb3c      0x084257e1      0x51b4c38c      0x00000001
0xbfd3fb10:     0x51b11738      0x00000058      0x00000008      0x00000002
0xbfd3fb20:     0x0d099c40      0x0d099d3c

0x51b4c38c这是一个地址，从v$latch和v$latch_children中可以查到正是是 session idle bit latch。

6.在gdb中继续：

(gdb) c    
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb) info f
Stack level 0, frame at 0xbfd3fb18:
 eip = 0x833c05d in kslfre; saved eip 0x842581a
 called by frame at 0xbfd3fb44
 Arglist at 0xbfd3fb10, args: 
 Locals at 0xbfd3fb10, Previous frame's sp is 0xbfd3fb18
 Saved registers:
  ebp at 0xbfd3fb10, eip at 0xbfd3fb14
(gdb) x /10x 0xbfd3fb10
0xbfd3fb10:     0xbfd3fb3c      0x0842581a      0x51b4c38c      0x00000002
0xbfd3fb20:     0x0d099c40      0x0d099d3c      0x0d099c40      0x00000002
0xbfd3fb30:     0x00000000      0x517a7090
(gdb) c
Continuing.
----下面很重要-----
Breakpoint 1, 0x08339fca in kslgetl ()
(gdb) x /10x $ebp+8
0xbfd3ecc0:     0x4f7c7594      0x00000001      0x00000001      0x00000a57
0xbfd3ecd0:     0x0af275ca      0x00000000      0x0dcfdcc0      0x0d045860
0xbfd3ece0:     0xbfd3ed58      0x0b20fc09
(gdb) c
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb) x /10x $ebp+8
0xbfd3ed64:     0x4f7c7594      0x00000000      0x10010040      0x0d045860
0xbfd3ed74:     0xbfd3ef10      0x0b1f5d10      0x0d099c40      0x4f7c7594
0xbfd3ed84:     0x00000000      0x00000008
(gdb) c
Continuing.

Breakpoint 1, 0x08339fca in kslgetl ()
(gdb) x /10x $ebp+8
0xbfd3ecc0:     0x4f7c7594      0x00000001      0x00000001      0x00000a57
0xbfd3ecd0:     0x0af275ca      0x00000000      0x0dcfdcc0      0x0d045860
0xbfd3ece0:     0xbfd3ed58      0x0b20fc09
(gdb) c
Continuing.

Breakpoint 1, 0x08339fca in kslgetl ()
(gdb)  x /10x $ebp+8
0xbfd3eb50:     0x200999bc      0x00000001      0x00000000      0x000009de
0xbfd3eb60:     0x0d099c40      0x0000000f      0x0c3c3ec0      0x0d045860
0xbfd3eb70:     0xbfd3ec10      0x0af3983b
(gdb) c
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb)  x /10x $ebp+8
0xbfd3eb6c:     0x200999bc      0x0d099c40      0x2001d310      0x0d045860
0xbfd3eb7c:     0xbfd3ec10      0x0af390a8      0x0d099c40      0x200999bc
0xbfd3eb8c:     0xbfd3ebe8      0x00000164
(gdb) c
Continuing.

Breakpoint 8, 0x0833c05d in kslfre ()
(gdb)  x /10x $ebp+8
0xbfd3ed64:     0x4f7c7594      0x00000000      0x10010040      0x0d045860
0xbfd3ed74:     0xbfd3ef10      0x0b1f58e6      0x0d099c40      0x4f7c7594
0xbfd3ed84:     0x00000000      0x00000008
(gdb) quit
A debugging session is active.

        Inferior 1 [process 4763] will be detached.

同时可以从数据库中查到：

SQL> select addr,latch#,child#,level#,name from v$latch_children 
where name in ('library cache','shared pool');
ADDR         LATCH#     CHILD#     LEVEL# NAME
-------- ---------- ---------- ---------- --------------------------------------------------
4F7C75F8        217          1          5 library cache
4F7C7594        217          2          5 library cache
4F7C7530        217          3          5 library cache
200999BC        216          1          7 shared pool
20099A20        216          2          7 shared pool
20099A84        216          3          7 shared pool
20099AE8        216          4          7 shared pool
20099B4C        216          5          7 shared pool
20099BB0        216          6          7 shared pool
20099C14        216          7          7 shared pool

标记为”下面很重要“的输出，汇总起来就是：

kslgetl  0x4f7c7594  library cache
kslfre   0x4f7c7594  library cache
kslgetl  0x4f7c7594  library cache
kslgetl  0x200999bc  shared pool
kslfre   0x200999bc  shared pool
kslfre   0x4f7c7594  library cache

这个过程可以理解为：Oracle进程首先获取library cache latch，在library cache中查找是否已经有要执行的SQL语句（这里显然没有找到），然后释放library cache。接下来是硬解析SQL的后半部分，首先获取library cache latch，然后获取shared pool latch，从shared pool中申请内存，然后释放shared pool latch，最后再获取library cache latch。

后面不再继续跟踪解析的过程，这个过程会有相当多的latch的获取和释放，但是从上面可以很清晰地看到，进程在持有了library cache latch的同时去请求shared pool latch。

至于这个案例，大家会在《DBA 手记III》上看到完整的介绍。

latch, performance

Address: https://www.laoxiong.net/shared-pool-latch-and-library-cache-latch.html

« ORA-04031案例一则

存储过程与DB LINK的依赖关系存在的问题 »

Trackback

13 comments untill now

Sidney @ 2011-05-03 23:54

hi 老熊，谢谢分享，获益匪浅。
指出一出笔误，在第一种方法的最后，

进程的确是在持有了shared pool latch的同时去请求library cache latch
—->
进程的确是在持有了library cache latch的同时去请求shared pool latch

[回复]
老熊回复:
5月 4th, 2011 at 12:16 上午
谢谢指出文章中的错误。

[回复]
东东 @ 2011-05-05 13:18

不错，前面看老白的案例，觉得就有些怪异，照理说参数查个20绝对不可能出现类似问题的，看你的介绍，才比较明了，不过我还是有点奇怪的是老白说的是他去修改了那个参数，问题就解决了，你应该是在前面做过的，为什么没有完全生效，或者就是你们一起做，然后系统好了，呵呵，不知道是哪个效果了。

[回复]
老熊回复:
5月 5th, 2011 at 3:10 下午
@东东, 我不知道你从哪里看到的。
http://www.oraclefans.cn/forum/showtopic.jsp?rootid=32821
这里有我跟他的完整讨论。

[回复]
东东 @ 2011-05-05 16:18

看到完整讨论了，非常精彩，根据前面的结果，怎么看都是这个隐藏参数惹的祸，这个修改了的话应该就解决问题了，就是不知道怎么客户又去找老白改其他参数，难道你处理完以后，客户在监控的时候还是有问题?

[回复]
老熊回复:
5月 5th, 2011 at 4:42 下午
@东东, 是客户同时找到我们，然后最后实施的时候，综合了我们的方案。

[回复]
Meets @ 2011-05-18 15:44

那段黑体字表达的太cool 了，下文对这个猜测的证明，简明又有说服力，精彩！

[回复]
cc @ 2011-05-21 21:11

有个地方不明白：

“_kghdsidx_count”=1，使得shared pool latch只有1个child latch。但是并没有使得library cache latch为1.

>>会话3解析SQL语句，首先需要请求library cache latch，如果请求的library cache child latch刚好是#1，那么由于会话2持有了这个child latch，就会等待library cache latch。

会话3解析SQL语句时候，获取library cache latch, 发现#1被持有，但是Oracle应该自己去获取#2呀，为什么非要在#1去获取呢？就像其他child latch一样呀。

还有如果是单CPU系统，不知道library cache latch的值是多少？我觉得不可能是1，如果是的话，岂不是稍微多一点并发，library cache latch争用岂不是很严重？

[回复]
老熊回复:
5月 21st, 2011 at 9:52 下午
单CPU的library cache latch还真没注意过。不过现在主流系统都在使用10g以上的版本了，cursor的管理采用mutex，理论上讲比library cache latch更能避免竞争。

[回复]
老熊的三分地-Oracle、UNIX、数据恢复 » Blog Archive » 修改隐含参数_library_cache_advice解决性能问题一例 @ 2011-10-03 01:10

[…] cache latch的争用（请参考我的另一篇文章《shared pool latch和library cache latch》），所以我们这里主要考虑是什么引起了shared pool latch的争用。另外，cursor: […]
mikewang @ 2011-10-24 13:18

Hi,老熊
这个案例，我看了你这里的描述，老白那里的描述，还有DBA手记III中的描述。
我认为，一些调整措施中，删除了某些表的直方图统计也是有效果的。
在cursor_sharing=smiliar情况下，有直方图的SQL执行会有类似hard parse出现，也就是需要shared pool latch.
在调整前的hard parse值为每秒19次，调整后，为10次。但parse数是从每秒2799次上升到8367次。
调整前，两套库的cpu_count是不同的，后来虽然调整为一致。但按照你说的，CPU_COUNT=80时，也有问题。
_kghdsidx_count，这个参数调整前，shared pool latch只有一个，要处理每秒19次hard parse.
你调整后，shared pool latch 会提升，具体值我不清楚。但处理的hard parse反而降低了。
因此，性能恢复正常。

至于SHARED POOL SIZE调整，那是为以后的性能延续性考虑的。

以上个人见解，请指教。

Mike Wang

[回复]
老熊回复:
10月 24th, 2011 at 4:35 下午
在cursor_sharing=similar时，表上有直方图，的确可能产生child cursor，也就是额外的hard parse。
hard parse数量的降低，需要仔细评估，因为这涉及到到底是因为业务量不足引起的，还是由于系统性能引起的。

[回复]
» 硬解析和软解析情况下latch持有情况分析 Focus on Oracle @ 2012-05-05 13:21

[…] shared pool latch和library cache latch […]

Oracle及UNIX技术、Oracle数据恢复工具、观点

shared pool latch和library cache latch

13 comments untill now

Add your comment now

文章分类

所有文章

常用连接

博客连接

相关文章