Friday, July 22, 2005

How to debug a very high workload

How to debug a very high workload ?

While the system is in the state to use >90% kernel cpu time, run the
lockstat command to produce a basic kernel profile, e.g. while a
"sleep 15" command is runing:

# lockstat -kIW -D 20 sleep 15

What functions are listed as top kernel cpu time consumers?

I have reproduced the problem very carefully. The load starts to
increase whenever I remove some very big files. For example; 5 files
which take up approx. 24Gb together. This happens on the disks which are
not part of the Raid1. Whenever this happens these are the load
consumers:

1882 83% 83% 0.00 2713647 cpu[0] logmap_cancel
203 9% 92% 0.00 4762898 cpu[0] fakesoftint_return
22 1% 93% 0.00 5909501 cpu[0] cpu_halt

I've already tried to look into logmap_cancel but unfortunatly SunSolve
seems to be down (I can't login "500 - internal server error") and
google doesn't help me much here either, so I hope this makes a little
more sense to you.

Lion-O <n...@m.catslair.org> writes:
>> # lockstat -kIW -D 20 sleep 15

>> What functions are listed as top kernel cpu time consumers?

> I have reproduced the problem very carefully. The load starts to
> increase whenever I remove some very big files. For example; 5 files
> which take up approx. 24Gb together. This happens on the disks which are
> not part of the Raid1. Whenever this happens these are the load
> consumers:

> 1882 83% 83% 0.00 2713647 cpu[0] logmap_cancel
> 203 9% 92% 0.00 4762898 cpu[0] fakesoftint_return
> 22 1% 93% 0.00 5909501 cpu[0] cpu_halt

Yes, I can reproduce it on S10 x86 GA (but not on S10 sparc GA), with
a standard UFS logging filesystem (40G total, 18G free space), when
deleting three 5GB files:

% mkfile 5G a b c
% sync
% lockfs -f .

System is idle, no disk activity at this point. Then:

% rm a b c
%

After "rm" has exited, kernel cpu time usage rises to > 90% and a
lockstat kernel profile shows that logmap_cancel() is responsible for
the kernel's cpu time usage. The kernel is in this state for ~ 90
seconds.

# lockstat -kIW -D 20 sleep 20

Profiling interrupt: 1972 events in 20.241 seconds (97 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PIL Caller
-------------------------------------------------------------------------------
1679 85% 85% 0.00 246716 cpu[0] logmap_cancel
132 7% 92% 0.00 239730 cpu[0] cpu_halt
35 2% 94% 0.00 230008 cpu[0] (usermode)
18 1% 95% 0.00 325389 cpu[0] mutex_enter
12 1% 95% 0.00 241294 cpu[0] free
10 1% 96% 0.00 243065 cpu[0] logmap_cancel_delta
7 0% 96% 0.00 212141 cpu[0]+11 fakesoftint_return
6 0% 96% 0.00 260988 cpu[0] bzero
5 0% 97% 0.00 243710 cpu[0] rm_assize
4 0% 97% 0.00 343741 cpu[0] deltamap_add
4 0% 97% 0.00 115899 cpu[0] indirtrunc
4 0% 97% 0.00 136087 cpu[0] tsd_agent_get
4 0% 97% 0.00 191666 cpu[0] hash2ints
4 0% 98% 0.00 138739 cpu[0] bread_common
3 0% 98% 0.00 433313 cpu[0] logmap_free_cancel
3 0% 98% 0.00 326170 cpu[0] kcopy
2 0% 98% 0.00 110654 cpu[0] poll_common
2 0% 98% 0.00 187142 cpu[0] brelse
2 0% 98% 0.00 307884 cpu[0] bdwrite
2 0% 98% 0.00 61904 cpu[0]+11 lock_set_spl
-------------------------------------------------------------------------------

0 Comments:

Post a Comment

<< Home