newspaint

Documenting Problems That Were Difficult To Find The Answer To

How To Diagnose High Sys CPU On Linux

So you run top and you see something like the following:


top - 11:25:35 up 104 days,  5:35,  3 users,  load average: 8.42, 5.38, 3.01
Tasks: 408 total,   1 running, 407 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.3%us, 31.5%sy,  0.0%ni, 63.1%id,  0.3%wa,  0.0%hi,  0.8%si,  0.0%st
Mem:  66004540k total, 25653868k used, 40350672k free,  1450608k buffers
Swap:  5242872k total,        0k used,  5242872k free, 22426464k cached

What Is Sys CPU Time?

What is this “sys” CPU time? It refers to time the CPU spent running in kernel mode – usually one of the functions listed in this list of syscalls.

Why Do We Care About High Sys CPU Time?

There may be the odd time when a high proportion of system CPU time is acceptable but this would be rare (perhaps for a router or other high I/O server). An application server, however, is supposed to do intelligent things and intelligence mostly comes from the user-level CPU time (either as “user” or “nice”). So when the proportion of CPU time is mostly spent in the kernel (system) it indicates something is wrong – perhaps the application is designed badly or inefficiently – or perhaps the system is having I/O troubles.

How To Diagnose The Cause

We want to find out why a particular process is causing abnormally high system CPU time. The top tool will illustrate which process is using most of the CPU – but then you may need to select an individual thread to dig into. To list the threads in a process on Linux use the following command:

ps -AL |grep process_name

The strace tool is invaluable for determining if a particular call is being made an excessive amount. The tool is useful for debugging as it records the system calls being made by a process. The count option (-c) gives some performance information about a process giving a view of which call is taking up how much of the system CPU time:


[root@host ~]# strace -c -p 6615
Process 6615 attached - interrupt to quit
Process 6615 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 69.16    0.764598           4    214801           getrusage
 25.20    0.278548         142      1959           poll
  3.67    0.040564          13      3101           write
  0.87    0.009578          16       597        29 futex
  0.84    0.009302           5      1959           read
  0.18    0.001999        1000         2           nanosleep
  0.05    0.000599          50        12           clock_gettime
  0.03    0.000284           6        50           sched_yield
  0.00    0.000000           0         1           close
  0.00    0.000000           0         1           shutdown
------ ----------- ----------- --------- --------- ----------------
100.00    1.105472                222483        29 total

In this example it appears that the getrusage() call is being made so frequently (100x more than any other system call) that it is taking up nearly 70% of the system CPU time of this process. Clearly this is unhealthy and if we look at the man page for this call it appears to be related to statistics/debugging running in this application.

Another technique is to dump the running stack of all the threads in the misbehaving process:


  for i in `ps -AL |grep process_name |cut -c 7-12`; do \
    echo === $i ===; \
    gdb --q --n --ex bt --batch --pid $i; \
  done 2>&1 |tee /tmp/stacks.txt

You may need to run that command line several times to see if any particular user function is the primary culprit.

3 responses to “How To Diagnose High Sys CPU On Linux

  1. allain 2015-12-16 at 14:57:14

    Much obliged. This helped me out greatly.

  2. Ravi 2016-09-27 at 11:06:57

    Thanks for the great post.

  3. Danny 2017-09-05 at 04:35:42

    Thanks for the great post.

Leave a comment