newspaint

Documenting Problems That Were Difficult To Find The Answer To

Inline C get_attrs Perl Function For Extracting XML Attributes From Tag

Given an XML tag like:

<field name="tcp.seq" showname="Sequence number: 1    (relative sequence number)" size="4" pos="38" show="1" value="0bb1a171"/>

and the attributes portion of:

name="tcp.seq" showname="Sequence number: 1    (relative sequence number)" size="4" pos="38" show="1" value="0bb1a171"

.. I wanted a function to convert the attributes into a Perl hash:

$hash = +{
  'name' => 'tcp.seq',
  'showname' => 'Sequence number: 1    (relative sequence number)',
  'size' => '4',
  'pos' => '38',
  'show' => '1',
  'value' => '0bb1a171',
}

My first implementation was in Perl:

sub get_attrs {
    my ( $str ) = @_;

    my %data = ();
    while ( $str =~ m{([^ \t=]+)=(?:(\"([^\"]*)\")|([^ />]+))}g ) {
        $data{$1} = ( $2 ? $3 : $4 );
    }

    return( \%data );
}

But in a script of mine this was profiled (using Devel::DProf) as using 64% of the CPU time allocated to the script. So I went about writing a C version:

use Inline C => <<'END_OF_C_CODE';

SV* get_attrs( char *str ) {
    char keyword[1024];
    char valueword[1024];
    char *pstart;
    char *pend;

    HV* hash = newHV();

    pstart = str;
    while ( *pstart ) {
        /* go to next key */
        while ( *pstart && ( *pstart == ' ' ) )
            pstart++;

        /* exit if end of line */
        if ( ! *pstart )
            break;

        /* find end of keyword */
        pend = pstart;
        while ( *pend && ( *pend != '=' ) )
            pend++;

        if ( *pend != '=' ) {
            pstart = pend;
            continue; /* restart search from current point */
        }

        if ( ( pend - pstart ) >= (sizeof(keyword) - 1) )
            break; /* too long */

        memcpy( keyword, pstart, pend - pstart );
        keyword[ pend - pstart ] = '\0';

        /* find start of valueword (skip over equals sign) */
        pstart = pend + 1;

        /* check if value enclosed in double or single quotes */
        if ( ( *pstart == '\"' ) || ( *pstart == '\'' ) ) {
            char quote = *pstart; /* remember what closes value string */

            /* beginning over value starts after quote */
            pstart++;
            pend = pstart;
            while ( *pend && ( *pend != quote ) )
                pend++;

            if ( *pend != quote )
                break;

            if ( ( pend - pstart ) >= (sizeof(valueword) - 1) )
                break; /* too long */

            memcpy( valueword, pstart, pend - pstart );
            valueword[ pend - pstart ] = '\0';

            pstart = pend + 1; /* skip over quote */
        } else {
            pend = pstart;
            while ( *pend && ( *pend != ' ' ) )
                pend++;

            if ( ! *pend )
                break;

            if ( ( pend - pstart ) >= (sizeof(valueword) - 1) )
                break; /* too long */

            memcpy( valueword, pstart, pend - pstart );
            valueword[ pend - pstart ] = '\0';

            pstart = pend;
        }

        /* store and continue */
        SV *svValue = newSVpvf( valueword );
        hv_store( hash, keyword, strlen(keyword), (SV*)svValue, 0 );
    }

    return newRV_noinc((SV*) hash);
}

END_OF_C_CODE

Benchmarking

Using Test::More and Benchmark I benchmarked the speed difference between the two routines (and verified they returned the same hash when provided the attribute string).

I had the following code to test:

use Test::More;
use Benchmark;

my $testdata = 'name="tcp.seq" showname="Sequence number: 1 (relative sequence number)" size="4" pos="38" show="1" value="0bb1a171"';

Test::More::is_deeply(
    get_attrs1( $testdata ),
    get_attrs2( $testdata ),
    'functions_return_same'
);
Test::More::done_testing();

Benchmark::cmpthese(
    5000000,
    {
        test_inline => "get_attrs1( \'$testdata\' );",
        test_perl => "get_attrs2( \'$testdata\' );",
    }
);

… which output:


user@host:/tmp$ perl -w test.pl
ok 1 - functions_return_same
1..1
                Rate   test_perl test_inline
test_perl    76464/s          --        -79%
test_inline 356888/s        367%          --

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: