So you run top and you see something like the following:
top - 11:25:35 up 104 days, 5:35, 3 users, load average: 8.42, 5.38, 3.01
Tasks: 408 total, 1 running, 407 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.3%us, 31.5%sy, 0.0%ni, 63.1%id, 0.3%wa, 0.0%hi, 0.8%si, 0.0%st
Mem: 66004540k total, 25653868k used, 40350672k free, 1450608k buffers
Swap: 5242872k total, 0k used, 5242872k free, 22426464k cached
What Is Sys CPU Time?
What is this “sys” CPU time? It refers to time the CPU spent running in kernel mode – usually one of the functions listed in this list of syscalls.
Why Do We Care About High Sys CPU Time?
There may be the odd time when a high proportion of system CPU time is acceptable but this would be rare (perhaps for a router or other high I/O server). An application server, however, is supposed to do intelligent things and intelligence mostly comes from the user-level CPU time (either as “user” or “nice”). So when the proportion of CPU time is mostly spent in the kernel (system) it indicates something is wrong – perhaps the application is designed badly or inefficiently – or perhaps the system is having I/O troubles.
How To Diagnose The Cause
We want to find out why a particular process is causing abnormally high system CPU time. The top tool will illustrate which process is using most of the CPU – but then you may need to select an individual thread to dig into. To list the threads in a process on Linux use the following command:
ps -AL |grep process_name
The strace tool is invaluable for determining if a particular call is being made an excessive amount. The tool is useful for debugging as it records the system calls being made by a process. The count option (-c) gives some performance information about a process giving a view of which call is taking up how much of the system CPU time:
[root@host ~]# strace -c -p 6615
Process 6615 attached - interrupt to quit
Process 6615 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
69.16 0.764598 4 214801 getrusage
25.20 0.278548 142 1959 poll
3.67 0.040564 13 3101 write
0.87 0.009578 16 597 29 futex
0.84 0.009302 5 1959 read
0.18 0.001999 1000 2 nanosleep
0.05 0.000599 50 12 clock_gettime
0.03 0.000284 6 50 sched_yield
0.00 0.000000 0 1 close
0.00 0.000000 0 1 shutdown
------ ----------- ----------- --------- --------- ----------------
100.00 1.105472 222483 29 total
In this example it appears that the getrusage() call is being made so frequently (100x more than any other system call) that it is taking up nearly 70% of the system CPU time of this process. Clearly this is unhealthy and if we look at the man page for this call it appears to be related to statistics/debugging running in this application.
Another technique is to dump the running stack of all the threads in the misbehaving process:
for i in `ps -AL |grep process_name |cut -c 7-12`; do \
echo === $i ===; \
gdb --q --n --ex bt --batch --pid $i; \
done 2>&1 |tee /tmp/stacks.txt
You may need to run that command line several times to see if any particular user function is the primary culprit.
Much obliged. This helped me out greatly.
Thanks for the great post.
Thanks for the great post.