newspaint

Documenting Problems That Were Difficult To Find The Answer To

Monthly Archives: July 2013

The Software Quality Chain

In any project there are three places software quality issues can be identified.

  • at development
  • formalised testing
  • with the end-users

Ultimately you hope your end-users never encounter bugs – and if they do encounter bugs you hope they are minor and that you can do something about it.

Every stage of the process reduces risk of release of software bugs (if done well); however risk is never eliminated.

Development

Developers have the opportunity to protect against a wide range of software bugs by coding defensively.

More importantly developers have the choice of whether to get to the bottom of problems reported further down the chain (by formalised testers or end-users) and making lasting changes to prevent re-occurrences.

Creating a basic set of automated unit tests allow developers a degree of confidence that any problems encountered in the past will be likely flushed out before being passed on to formalised testing.

Formalised Testing

Formalised testing has one and only one task: find bugs before the end-users do.

If testers are having to work with poorly disciplined developers then their tasks suddenly becomes exponentially more difficult – because they are not only going to be searching for small defects, but they are likely going to be inundated with large and mundane defects as well.

Worse, those testers may end up shell-shocked as the same bugs are re-introduced over and over again by a lazy development team uninterested in properly solving problems once and for all.

End Users

End users, when encountering bugs, can be offered the opportunity to report them before other end-users encounter those bugs – giving a project the opportunity to fix them and keep the majority of the customer base happy.

Can You Help Me To Get A National Rail Feed?

I want to write a very lightweight website for mobile users that allow them to enter a departure and arrival station pair on a line to get the best possible information about possible trips between those stations.

National Rail already provide a basic website showing trains leaving between two websites. But what I really want is:

  • fastest trips only (strip out any that begins before another train that arrives earlier)
  • text-mode only, no graphics (3G mobile can be unforgiving, this must be lightweight)
  • no JavaScript, information is to be as simple and basic as possible
  • shows departure platform (essential at busy terminals to know where to walk)
  • shows arrival platform (useful because arriving at some platforms is more convenient than others)
  • shows current estimated departure and arrival times
  • shows trip duration
  • shows how many stops on that trip

I already have a data-mining script that gets me this information for my personal use, and I rely on this every morning and every afternoon for my daily commute. But I shouldn’t have to use nefarious techniques to obtain this information and I’m sure it would be of enormous benefit to provide it to others.

The commuting public are treated abysmally by Network Rail in terms of physical conditions on peak-hour trains – with a significant proportion having to regularly stand for long periods (in excess of 20 minutes at a time).

Let’s respect them enough to keep them informed! I’m willing to help out but I need access to National Rail feeds – and if you know how then drop me a comment (all comments are moderated so I can keep them hidden if you so wish).

Update 2017-09-16: National Rail do actually have a free data feed API now. The wiki is available at http://nrodwiki.rockshore.net/ and the feeds are available from https://datafeeds.nationalrail.co.uk/.

How To Diagnose High Sys CPU On Linux

So you run top and you see something like the following:


top - 11:25:35 up 104 days,  5:35,  3 users,  load average: 8.42, 5.38, 3.01
Tasks: 408 total,   1 running, 407 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.3%us, 31.5%sy,  0.0%ni, 63.1%id,  0.3%wa,  0.0%hi,  0.8%si,  0.0%st
Mem:  66004540k total, 25653868k used, 40350672k free,  1450608k buffers
Swap:  5242872k total,        0k used,  5242872k free, 22426464k cached

What Is Sys CPU Time?

What is this “sys” CPU time? It refers to time the CPU spent running in kernel mode – usually one of the functions listed in this list of syscalls.

Why Do We Care About High Sys CPU Time?

There may be the odd time when a high proportion of system CPU time is acceptable but this would be rare (perhaps for a router or other high I/O server). An application server, however, is supposed to do intelligent things and intelligence mostly comes from the user-level CPU time (either as “user” or “nice”). So when the proportion of CPU time is mostly spent in the kernel (system) it indicates something is wrong – perhaps the application is designed badly or inefficiently – or perhaps the system is having I/O troubles.

How To Diagnose The Cause

We want to find out why a particular process is causing abnormally high system CPU time. The top tool will illustrate which process is using most of the CPU – but then you may need to select an individual thread to dig into. To list the threads in a process on Linux use the following command:

ps -AL |grep process_name

The strace tool is invaluable for determining if a particular call is being made an excessive amount. The tool is useful for debugging as it records the system calls being made by a process. The count option (-c) gives some performance information about a process giving a view of which call is taking up how much of the system CPU time:


[root@host ~]# strace -c -p 6615
Process 6615 attached - interrupt to quit
Process 6615 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 69.16    0.764598           4    214801           getrusage
 25.20    0.278548         142      1959           poll
  3.67    0.040564          13      3101           write
  0.87    0.009578          16       597        29 futex
  0.84    0.009302           5      1959           read
  0.18    0.001999        1000         2           nanosleep
  0.05    0.000599          50        12           clock_gettime
  0.03    0.000284           6        50           sched_yield
  0.00    0.000000           0         1           close
  0.00    0.000000           0         1           shutdown
------ ----------- ----------- --------- --------- ----------------
100.00    1.105472                222483        29 total

In this example it appears that the getrusage() call is being made so frequently (100x more than any other system call) that it is taking up nearly 70% of the system CPU time of this process. Clearly this is unhealthy and if we look at the man page for this call it appears to be related to statistics/debugging running in this application.

Another technique is to dump the running stack of all the threads in the misbehaving process:


  for i in `ps -AL |grep process_name |cut -c 7-12`; do \
    echo === $i ===; \
    gdb --q --n --ex bt --batch --pid $i; \
  done 2>&1 |tee /tmp/stacks.txt

You may need to run that command line several times to see if any particular user function is the primary culprit.

Inline C get_attrs Perl Function For Extracting XML Attributes From Tag

Given an XML tag like:

<field name="tcp.seq" showname="Sequence number: 1    (relative sequence number)" size="4" pos="38" show="1" value="0bb1a171"/>

and the attributes portion of:

name="tcp.seq" showname="Sequence number: 1    (relative sequence number)" size="4" pos="38" show="1" value="0bb1a171"

.. I wanted a function to convert the attributes into a Perl hash:

$hash = +{
  'name' => 'tcp.seq',
  'showname' => 'Sequence number: 1    (relative sequence number)',
  'size' => '4',
  'pos' => '38',
  'show' => '1',
  'value' => '0bb1a171',
}

My first implementation was in Perl:

sub get_attrs {
    my ( $str ) = @_;

    my %data = ();
    while ( $str =~ m{([^ \t=]+)=(?:(\"([^\"]*)\")|([^ />]+))}g ) {
        $data{$1} = ( $2 ? $3 : $4 );
    }

    return( \%data );
}

But in a script of mine this was profiled (using Devel::DProf) as using 64% of the CPU time allocated to the script. So I went about writing a C version:

use Inline C => <<'END_OF_C_CODE';

SV* get_attrs( char *str ) {
    char keyword[1024];
    char valueword[1024];
    char *pstart;
    char *pend;

    HV* hash = newHV();

    pstart = str;
    while ( *pstart ) {
        /* go to next key */
        while ( *pstart && ( *pstart == ' ' ) )
            pstart++;

        /* exit if end of line */
        if ( ! *pstart )
            break;

        /* find end of keyword */
        pend = pstart;
        while ( *pend && ( *pend != '=' ) )
            pend++;

        if ( *pend != '=' ) {
            pstart = pend;
            continue; /* restart search from current point */
        }

        if ( ( pend - pstart ) >= (sizeof(keyword) - 1) )
            break; /* too long */

        memcpy( keyword, pstart, pend - pstart );
        keyword[ pend - pstart ] = '\0';

        /* find start of valueword (skip over equals sign) */
        pstart = pend + 1;

        /* check if value enclosed in double or single quotes */
        if ( ( *pstart == '\"' ) || ( *pstart == '\'' ) ) {
            char quote = *pstart; /* remember what closes value string */

            /* beginning over value starts after quote */
            pstart++;
            pend = pstart;
            while ( *pend && ( *pend != quote ) )
                pend++;

            if ( *pend != quote )
                break;

            if ( ( pend - pstart ) >= (sizeof(valueword) - 1) )
                break; /* too long */

            memcpy( valueword, pstart, pend - pstart );
            valueword[ pend - pstart ] = '\0';

            pstart = pend + 1; /* skip over quote */
        } else {
            pend = pstart;
            while ( *pend && ( *pend != ' ' ) )
                pend++;

            if ( ! *pend )
                break;

            if ( ( pend - pstart ) >= (sizeof(valueword) - 1) )
                break; /* too long */

            memcpy( valueword, pstart, pend - pstart );
            valueword[ pend - pstart ] = '\0';

            pstart = pend;
        }

        /* store and continue */
        SV *svValue = newSVpvf( valueword );
        hv_store( hash, keyword, strlen(keyword), (SV*)svValue, 0 );
    }

    return newRV_noinc((SV*) hash);
}

END_OF_C_CODE

Benchmarking

Using Test::More and Benchmark I benchmarked the speed difference between the two routines (and verified they returned the same hash when provided the attribute string).

I had the following code to test:

use Test::More;
use Benchmark;

my $testdata = 'name="tcp.seq" showname="Sequence number: 1 (relative sequence number)" size="4" pos="38" show="1" value="0bb1a171"';

Test::More::is_deeply(
    get_attrs1( $testdata ),
    get_attrs2( $testdata ),
    'functions_return_same'
);
Test::More::done_testing();

Benchmark::cmpthese(
    5000000,
    {
        test_inline => "get_attrs1( \'$testdata\' );",
        test_perl => "get_attrs2( \'$testdata\' );",
    }
);

… which output:


user@host:/tmp$ perl -w test.pl
ok 1 - functions_return_same
1..1
                Rate   test_perl test_inline
test_perl    76464/s          --        -79%
test_inline 356888/s        367%          --

How To Remove The 2013 GMail Categories Tab Bar

Today, 17 July 2013, I discovered Google had added a new “categories” bar to my GMail page with the “primary”, “social”, and “promotions” categories.

I didn’t want these, and don’t foresee a need for them, and wanted to remove them. How to do this?

Click on the plus (+) icon to the right of the “promotions” tab.

Click on the plus icon on the right, next to the promotions tab

Click on the plus icon on the right, next to the promotions tab

Then untick all but the “primary” category on the options window that appears, then click “save”.

Untick the "social" and "promotions" categories and then click "save"

Untick the “social” and “promotions” categories and then click “save”

This will remove the categories tab.

To Get The Categories Tab Back

To restore the categories tab bar click on the settings icon (the “gear”) and choose “Configure inbox”.

Click on the settings ("the gear") icon and choose "Configure inbox"

Click on the settings (“the gear”) icon and choose “Configure inbox”

This will bring up the window showing the categories that can be selected (which you unticked to turn the bar off).

Avoiding Thundering Herd in Memcached

This article might also be interpreted as “locking shared resources using memcache”.

A common problem with sites that use memcached is the worry that when memcached is restarted and the cache is empty then a highly-scaled application will hit the cold cache and find it empty and then many threads will proceed to all simultaneously make heavy computed lookups (typically in a database) bringing the whole site to a halt. This is called the “Thundering Herd” problem.

There are two worries about “Thundering Herd”. The first, and considered by this article, is where you may have a complex and expensive database query that is used to calculate a cached value; you don’t want 100 users simultaneously trying to access this value soon after you start your application resulting in that complex database query being executed 100 times when a single execution will do the job. Better to wait for that single query to complete, populate the cache, then all the other users can get the response near instantly afterwards. The second worry is that an empty cache will need filling from many different types of queries – this article does not help with that problem – but then pre-filling a cache could be difficult (if which needed cached values are not determinable ahead of time) and will take time in such a situation nonetheless.

A common recommendation to avoid this is to create “warm-up” scripts that pre-fill memcached before letting the application loose at it.

I think there’s another way. Using access controls via memcached itself. Because the locks should be cheap and access fast (compared to doing a database lookup). You can build these techniques into your application and forget about Thundering Herd.

This works by locking memcached keys using another memcached key. And putting your application to sleep (polling) until the lock is released if another process has it. This way if two clients attempt to access the same key – then one can discover it is missing/empty, do the expensive computation once, and then release the lock – the other client will wake up and use the pre-computed key.

If You Can Get The Result From The ADD Operation

We can use the fact that the ADD operation is atomic. Either it succeeds because the key didn’t exist before, or it fails because the key exists. We set an expiry time of 10 seconds in case the process crashes before releasing the lock ensuring that it eventually times out and disappears.

function get_cache_or_calculate( memcache_key, compute_callback, lifetime ) {
  // wait to acquire lock key
  // add key if doesn't exist (10 seconds lifetime)
  while ( ADD( memcache_key + "_lock", 10 ) == false ) {
    sleep( 100ms ); // sleep a little bit then check again
  }

  // we hold the lock
  var value = GET( memcache_key );
  if ( value != null ) {
    // release lock
    DELETE( memcache_key + "_lock" );
    return( value );
  }

  // key did not exist in memcache, compute
  var value = compute_callback();
  PUT( memcache_key, value, lifetime );

  // release lock
  DELETE( memcache_key + "_lock" );
  return( value );
}

function set_cache( memcache_key, value, lifetime ) {
  // wait to acquire lock key
  // add key if doesn't exist (10 seconds lifetime)
  while ( ADD( memcache_key + "_lock", 10 ) == false ) {
    sleep( 100ms ); // sleep a little bit then check again
  }

  // we hold the lock
  PUT( memcache_key, value, lifetime );

  // release lock
  DELETE( memcache_key + "_lock" );
}

If You Cannot Get The Result From The ADD Operation

function get_cache_or_calculate( memcache_key, compute_callback, lifetime ) {
  // add key if doesn't exist
  ADD( memcache_key + "_lock", 10 ); // 10 seconds lifetime

  // wait to acquire lock key
  while ( true ) {
    var lock_count = INCR( memcache_key + "_lock" );
    if ( lock_count == 1 )
      break; // we have acquired the lock!

    // somebody else has the lock
    DECR( memcache_key + "_lock" );
    sleep( 100ms );
  }

  // we hold the lock
  var value = GET( memcache_key );
  if ( value != null ) {
    DELETE( memcache_key + "_lock" );
    return( value );
  }

  // key did not exist in memcache, compute
  var value = compute_callback();
  PUT( memcache_key, value, lifetime );
  DELETE( memcache_key + "_lock" );
  return( value );
}

function set_cache( memcache_key, value, lifetime ) {
  // wait to acquire lock key
  while ( true ) {
    var lock_count = INCR( memcache_key + "_lock" );
    if ( lock_count == 1 )
      break; // we have acquired the lock!

    // somebody else has the lock
    DECR( memcache_key + "_lock" );
    sleep( 100ms );
  }

  // we hold the lock
  PUT( memcache_key, value, lifetime );
  DELETE( memcache_key + "_lock" );
}

What if the process holding the key dies?

To protect against this a timeout/expiry should be set for every lock object of a reasonable value – say 10 seconds. It would be painful for the particular customer waiting this period but should happen only once.

This involves 3 memcached calls instead of 1 for a read! Isn’t that inefficient?

Yes, it is inefficient. However it is still a lot faster than making a single SELECT call to a database.

Why do you DELETE instead of DECR the lock after you’re finished with it?

Because we want the next acquirer of the lock to set the timeout on it. And one of them will ADD the key (doesn’t matter which) just before the same one or another one acquires it (gets a value of 1 upon INCR). Any others will get a result >1 for the INCR operation and go to sleep before re-testing.

If we merely DECR upon finishing with the lock the lock timeout would still be ticking – leaving us with the possibility that another process will acquire the old lock and it might expire while in use resulting in a collision when another process creates a new lock.

A sleep time of 100ms? Won’t that cause problems?

Potentially a variation of “live lock” can occur. Say you have 100 threads all wanting to get a common cached value; they all try and ADD a lock only to find it already exists. So they all go to sleep for 100ms. Then they all wake up at the same time and only one gets to acquire a lock. Now you have 99 threads that go to sleep for 100ms. If this continues in this fashion then the last thread to acquire the lock will have waited 10 seconds (100 x 0.1s = 10s).

Such a situation is fairly unlikely as slight variations in execution time will mean that, in all probability, the wake up time will vary by a few milliseconds. So the sleep time could be reduced to, say, 25ms – but this may put additional CPU load (and network traffic to memcached, though slight) on your system.

A remedial action could be to specify a psuedo-random number as the sleep time – thus greatly increasing the probability that different threads will wake up at different times reducing the possibility of contention with another.. if this worries you enough.

How To Mount A Crypt Disk In /etc/crypttab

I have a /etc/crypttab like the following:

md0_crypt UUID=17f25217-6080-47f5-b6ce-63537f1fb165 none luks
md1_crypt UUID=c5585594-cc7b-41c2-bd5a-349205617a27 none luks
sdb4_crypt UUID=f656f25c-3fa3-4349-9175-c3f129b64589 none luks,swap
sdc4_crypt UUID=5cafe87d-6745-4cd1-bada-8c912663282b none luks,swap

Now I want to “mount” md1_crypt which didn’t mount at boot time due to RAID problems – and I’ve finally successfully restored the RAID1 array /dev/md1.

The command I used to mount the encrypted disk was:

cryptsetup luksOpen /dev/disk/by-uuid/c5585594-cc7b-41c2-bd5a-349205617a27 md1_crypt

In other words: cryptsetup luksOpen /dev/disk/by-uuid/[UUID] [mount_name]

I was asked for my password.

Then I entered the command:

mount /dev/mapper/md1_crypt

…to mount the decrypted disk to the file system from the pre-configured /etc/fstab.

Lost Notification Bar In CyanogenMod 10.1

I was running CyanogenMod version 10.1.0-RC5-p880 on my LG 4X phone.

For no reason after about a month of running this I suddenly and inexplicably lost the top notification bar. Well the bar was there, just that the notification bar went blank. No battery icon, no signal strength icon, no wireless icon, no icons at all – as this screenshot shows:

The top notification bar which should hold icons was completely blank

The top notification bar which should hold icons was completely blank

I even tried rebooting the phone a few times and this didn’t help; the notification bar remained blank.

How To Fix

I turned the phone off. Then I started the phone in recovery mode (to do this ensure the phone is powered completely off then hold down the volume down key – and while holding then press and hold the power button until the LG icon appears – then release the power button – and when the recovery mode screen appears release the volume down key).

Next I chose “Wipe Cache” – followed by “Advanced”, “Wipe Dalvik Cache”.

Then rebooted again – and the notification bar returned!