newspaint

Documenting Problems That Were Difficult To Find The Answer To

Category Archives: PhantomJS

PhantomJS window.setTimeout Appears to be Ignored

This could equally apply to a Node.JS program as much as a PhantomJS program.

You’ve got a JavaScript program that has many setTimeout() or window.setTimeout() calls. Yet your program appears to be failing for no apparent reason. Even after scrutinising your program – perhaps running it through esvalidate and concluding that there are no syntax errors – your script just appears to terminate without any reason. And those terminations happen before an expected setTimeout() callback is due to trigger.

You wouldn’t be alone if you suffered this problem. Quite a few people have contributed to a Github PhantomJS issue on this topic. Many assuming their call to setTimeout() failed.

Here is an example script that appears to terminate without reason before the final window.setTimeout() call:

function step3() {
    console.log( "+step3()" );
    window.setTimeout(
        function () {
            console.log( "-exiting phantomJS" );
            phantom.exit(1);
        }, 1000
    );
}

function step2() {
    console.log( "+step2()" );
    window.setTimeout(
        function () {
            step3();
            phantom.exit(1);
        }, 1000
    );
}

function step1() {
    console.log( "+step1()" );
    window.setTimeout(
        function () {
            step2();
        }, 1000
    );
}

step1();

When run this script outputs:

user@host:~$ phantomjs /tmp/test.js
+step1()
+step2()
+step3()

This script appeared to terminate without displaying the final “-exiting phantomJS” log message. So what went wrong? After all, step3() was executed as expected.

The problem is that an unwanted call to phantom.exit() was made after the call to step3() in step2(). What actually happens is that step3() is called, and the setTimeout() is made, but then the function returns back and executes the call to phantom.exit() – which terminates the script before the final setTimeout() callback has a chance to trigger.

How Can I Avoid This Silent Killer?

You can protect yourself from having this problem in the future. Never, never call phantom.exit() directly from your script. Instead add the following function and call this instead whenever you want to terminate your script:

function quit( reason, value ) {
    console.log( "QUIT: " + reason );
    phantom.exit( value );
}

Why do this? If your program aborts you want to know WHY. If you provide a unique message at every point in which your program can exit then you can quickly identify when a mistaken quit() call has interrupted your expected setTimeout() callback.

Here’s the fixed program:

function quit( reason, value ) {
    console.log( "QUIT: " + reason );
    phantom.exit( value );
}

function step3() {
    console.log( "+step3()" );
    window.setTimeout(
        function () {
            console.log( "-exiting phantomJS" );
            quit( "finished", 0 );
        }, 1000
    );
}

function step2() {
    console.log( "+step2()" );
    window.setTimeout(
        function () {
            step3();
            // phantom.exit(1); WAS CAUSING PROBLEM
        }, 1000
    );
}

function step1() {
    console.log( "+step1()" );
    window.setTimeout(
        function () {
            step2();
        }, 1000
    );
}

step1();

This outputs:

user@host:~$ phantomjs /tmp/test.js
+step1()
+step2()
+step3()
-exiting phantomJS
QUIT: finished

JavaScript Function to Call Functions In Sequence

This post is targeted towards PhantomJS where you may want to simulate a user entering data into several fields using the keyboard but pausing between each field entry.

The following function takes a list of functions and executes them with a fixed delay between each event. It takes two parameters:

  • events – array of functions to call sequentially
  • delay – delay, in milliseconds, to pause before calling each event
function space_out_events( events, delay, index ) {
    index = ( typeof( index ) !== 'undefined' ) ? index : 0;

    window.setTimeout(
        function () {
            events[index]();
            if ( index < ( events.length - 1 ) ) {
                space_out_events( events, delay, index + 1 );
            }
        }, delay
    );
}

Example Usage

You have a form on a page to fill out. You want to simulate all the information entered by keyboard with a pause of half a second (500ms), then tab key, then pause another 500ms, before filling out the next field.

    space_out_events(
        [
            function(){ page.sendEvent( 'keypress', config.email ); },
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            function(){ page.sendEvent( 'keypress', "Testing123" ); },
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            function(){ page.sendEvent( 'keypress', "Testing123" ); },
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            // ignore title
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            function(){ page.sendEvent( 'keypress', "John" ); },
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            function(){ page.sendEvent( 'keypress', "Smith" );},
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            // ignore mobile number
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            // ignore day of birth
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            // ignore month of birth
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            // ignore year of birth
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            // ignore screen name
            function(){ page.sendEvent( 'keypress', page.event.key.Tab ); },
            function(){ if ( config.debug ) { page.render( "filled_form.png" ); } },
            function(){ do_click_next_step( page ); }
        ], 500
    );

The technique could also be used with Node.JS.

Sampling Pixels on Web Page Using PhantomJS

Concerned your website might be hacked and defaced? How do you protect against this? One way might be to get a “snapshot” of your webpage as an image and then test pixels to see if they match the colours you expect at particular static locations. Not perfect but you’ll know quickly if someone wasn’t subtle when replacing the front-page content.

The technique I propose is this:

  • load the web page
  • render to a base-64 string
  • load “about:blank”
  • dynamically add an image to the DOM
  • assign the base-64 string to the image
  • dynamically add a canvas to the DOM
  • draw the image to the canvas’ context
  • sample points from the canvas’ context

First things first: load the web page and render it to a base 64 string:

var system = require('system');
var page = require('webpage').create();
page.onResourceError = function(resourceError) {
    page.reason = resourceError.errorString;
    page.reason_url = resourceError.url;
};

page.viewportSize = {
    'width': 1000,
    'height': 768
};

page.onConsoleMessage = function(msg) {
    system.stderr.writeLine('console: ' + msg);
};

page.open(
    'http://www.google.com/',
    function (status) {
        if ( status !== 'success' ) {
            console.log(
                "Error opening url \"" + page.reason_url
                + "\": " + page.reason
            );
            phantom.exit( 1 );
            return;
        }

        // take snapshot after 2 seconds to allow page AJAX to run
        window.setTimeout(
            function () {
                var imgBase64 = page.renderBase64();
                check_image( imgBase64 ); // function defined below
            },
            2 * 1000 // give page 2 seconds to execute before render
        );
    }
);

Now we have our base-64 string. Before proceeding I recommend that the page gets cleared to a blank page (so that the tested page’s JavaScript can’t wreak havoc on our image test function):

page.open(
    'about:blank',
    function ( status ) {
        check_image( imgBase64 ); // function defined below
    }
);

The pixel checking code must be done in a page.evaluate() function call because it will be getting the page to run JavaScript:

function check_image_js( imgString ) {
    // returns hex colour in ttrrggbb format (tt is opacity)
    function get_color( context, x, y ) {
        // return 2-character hex string (0-255)
        function decToHex( dd ) {
            var hex = Number(dd).toString(16);
            hex = "00".substr( 0, 2 - hex.length ) + hex;
            return hex;
        }

        var imgd = context.getImageData( x, y, 1, 1 ).data;
        var colorstr = (
            decToHex( imgd[3] ) + decToHex( imgd[0] ) +
            decToHex( imgd[1] ) + decToHex( imgd[2] )
        );

        return colorstr;
    }

    // list of pixels to check
    var pixels = [
        { x: 272, y: 135, col: 'ff000000', desc: 'Black in title' },
        { x: 97, y: 9, col: 'ffffffff', desc: 'White in button' }
    ];

    // create canvas
    var canvas = document.createElement( 'canvas' );
    canvas.width = 1000;
    canvas.height = 768;
    var context = canvas.getContext( '2d' );
    
    // load image
    var img = new Image();
    img.onload = function () { context.drawImage( img, 0, 0 ); };
    img.src = "data:image/png;base64," + imgString;

    // give time for image to load
    window.setTimeout(
        function () {
            var problemDetected = false;

            console.log( "- doing pixels" );

            var idx;
            for ( idx = 0; idx < pixels.length; idx++ ) {
                var row = pixels[idx];

                var actual = get_color( context, row.x, row.y );
                if ( actual !== row.col ) {
                    problemDetected = true;
                    console.log( "ERROR mismatch: expected \"" + row.col + "\", got \"" + actual + "\" (ttrrggbb) for \"" + row.desc + "\" at (" + row.x + ", " + row.y + ")" );
                } else {
                    console.log( "  - matched colour \"" + actual + "\" (ttrrggbb) for \"" + row.desc + "\" at (" + row.x + ", " + row.y + ")" );
                }
            }
        }, 500
    );
}

// call check_image_js in page.evaluate sandbox
function check_image( imgString ) {
    page.evaluate(
        function ( callback, string ) {
            callback( string );
        },
        check_image_js,
        imgString
    );
}

Getting To The Bottom Of Why A PhantomJS Page Load Fails

For this post I’m using PhantomJS version 1.9.

Quite frustratingly I occasionally have a call to page.open() where my callback receives a status of “fail”. This isn’t very helpful as it doesn’t describe what went wrong. Was it a SSL handshake problem (using the --ignore-ssl-errors=true command line argument may solve such problems)? Something else?

Unfortunately the PhantomJS API, at present, doesn’t appear to have an ability to determine the reason for the failure of the page to load. But there are a number of callbacks we can hook into to generate a lot of debugging messages to allow us to determine the reason for the failure.

Simplified Reason Tracking

Just before calling page.open() add the following code (after creating the page variable):

    page.onResourceError = function(resourceError) {
        page.reason = resourceError.errorString;
        page.reason_url = resourceError.url;
    };

Now you can print out the reason for a problem in your page.open() callback, e.g.:

var page = require('webpage').create();

page.onResourceError = function(resourceError) {
    page.reason = resourceError.errorString;
    page.reason_url = resourceError.url;
};

page.open(
    "http://www.nosuchdomain/",
    function (status) {
        if ( status !== 'success' ) {
            console.log(
                "Error opening url \"" + page.reason_url
                + "\": " + page.reason
            );
            phantom.exit( 1 );
        } else {
            console.log( "Successful page open!" );
            phantom.exit( 0 );
        }
    }
);

This script outputs the following:

Error opening url "http://www.nosuchdomain/": Host www.nosuchdomain not found

Detailed Logging

Just before calling page.open() add the following code (after creating the page variable):

    page.onResourceRequested = function (request) {
        system.stderr.writeLine('= onResourceRequested()');
        system.stderr.writeLine('  request: ' + JSON.stringify(request, undefined, 4));
    };

    page.onResourceReceived = function(response) {
        system.stderr.writeLine('= onResourceReceived()' );
        system.stderr.writeLine('  id: ' + response.id + ', stage: "' + response.stage + '", response: ' + JSON.stringify(response));
    };

    page.onLoadStarted = function() {
        system.stderr.writeLine('= onLoadStarted()');
        var currentUrl = page.evaluate(function() {
            return window.location.href;
        });
        system.stderr.writeLine('  leaving url: ' + currentUrl);
    };

    page.onLoadFinished = function(status) {
        system.stderr.writeLine('= onLoadFinished()');
        system.stderr.writeLine('  status: ' + status);
    };

    page.onNavigationRequested = function(url, type, willNavigate, main) {
        system.stderr.writeLine('= onNavigationRequested');
        system.stderr.writeLine('  destination_url: ' + url);
        system.stderr.writeLine('  type (cause): ' + type);
        system.stderr.writeLine('  will navigate: ' + willNavigate);
        system.stderr.writeLine('  from page\'s main frame: ' + main);
    };

    page.onResourceError = function(resourceError) {
        system.stderr.writeLine('= onResourceError()');
        system.stderr.writeLine('  - unable to load url: "' + resourceError.url + '"');
        system.stderr.writeLine('  - error code: ' + resourceError.errorCode + ', description: ' + resourceError.errorString );
    };

    page.onError = function(msg, trace) {
        system.stderr.writeLine('= onError()');
        var msgStack = ['  ERROR: ' + msg];
        if (trace) {
            msgStack.push('  TRACE:');
            trace.forEach(function(t) {
                msgStack.push('    -> ' + t.file + ': ' + t.line + (t.function ? ' (in function "' + t.function + '")' : ''));
            });
        }
        system.stderr.writeLine(msgStack.join('\n'));
    };

It is important that before this block gets called after the page and system variables are defined, e.g.:

var system = require('system');
var page = require('webpage').create();

PhantomJS Exit Doesn’t Exit Where You Expect It To Exit

Here’s a little trick using PhantomJS 1.9:

var system = require('system');

phantom.exit();
system.stdout.writeLine( "This is printed after exit!" );

Here’s what it does – it prints out:

This is printed after exit!

This doesn’t happen if you use the console.log() function but it does with the system.stdout and system.stderr objects.

It appears that PhantomJS finishes running the current function before actually terminating the JavaScript interpreter. So one should always take preventative actions to ensure no more code can run after a call to phantom.exit().

Waiting For Page To Load In PhantomJS

Here is a function I’ve created which waits for an element in the DOM to appear in PhantomJS.

Some modern JavaScript-dependant pages will accept your form submission then dynamically load the desired response – but this can take some time.

Syntax

The following function takes four parameters:

  • page – reference to the PhantomJS webpage object
  • selector – a string to pass to document.querySelector() to wait for
  • expiry – milliseconds past epoch at which waiting should cease
  • callback – the function to call on expiry or selector element found

Example

For example:

    // click button
    page.evaluate(
        function () {
            document.querySelector("button[name=do]").click();
            document.querySelector("form[name=theform]").submit();
        }
    );

    waitFor(
        page,
        "span.from", // wait for this object to appear
        (new Date()).getTime() + 5000, // timeout at 5 seconds from now
        function (status) {
            system.stderr.writeLine( "- submission status: " + status );

            if ( status ) {
                // success, element found by waitFor()
                page.render( "/tmp/results.png" );
                process_rows( page );
            } else {
                // waitFor() timed out
                phantom.exit( 1 );
            }
        }
    );

Implementation

The waitFor() function is defined as:

function waitFor( page, selector, expiry, callback ) {
    system.stderr.writeLine( "- waitFor( " + selector + ", " + expiry + " )" );

    // try and fetch the desired element from the page
    var result = page.evaluate(
        function (selector) {
            return document.querySelector( selector );
        }, selector
    );

    // if desired element found then call callback after 50ms
    if ( result ) {
        system.stderr.writeLine( "- trigger " + selector + " found" );
        window.setTimeout(
            function () {
                callback( true );
            },
            50
        );
        return;
    }

    // determine whether timeout is triggered
    var finish = (new Date()).getTime();
    if ( finish > expiry ) {
        system.stderr.writeLine( "- timed out" );
        callback( false );
        return;
    }

    // haven't timed out, haven't found object, so poll in another 100ms
    window.setTimeout(
        function () {
            waitFor( page, selector, expiry, callback );
        },
        100
    );
}

Some notes should be made. This function actually polls every 100 milliseconds. When it detects the desired object in the DOM it waits a further short period of 50 milliseconds as a precautionary measure in case the page was in the middle of generating at the time the element was detected.

How Do I Get The Current Page URL In PhantomJS?

If you simply do the following in PhantomJS:

console.log( "- current url is " + document.URL );

then you will see the javascript filename you are running with PhantomJS.

If you want to see the URL of the currently loaded page, however, then you have to do it within the loaded page’s sandbox:

var url = page.evaluate(
    function () {
        return document.URL;
    }
);

console.log( "- current url is " + url );

Adblock for PhantomJS

Starting from version 1.9 of PhantomJS there exists the ability to abort a request for a URL.

The below code is an example of how to do this (blocking by site name):

// extract domain name from a URL
function sitename( url ) {
    var result = /^https?:\/\/([^\/]+)/.exec( url );
    if ( result ) {
        return( result[1] );
    } else {
        return( null );
    }
}

// add a callback to every request performed on a webpage
function adblock( page ) {
    page.onResourceRequested = function ( requestData, networkRequest ) {
        // pull out site name from URL
        var site = sitename( requestData.url );
        if ( ! site )
            return;

        // abort requests for particular domains
        if (
            ( /\.doubleclick\./.test( site ) ) ||
            ( /\.pubmatic\.com$/.test( site ) )
        ) {
            console.error( "  - BLOCKED URL from " + site );
            networkRequest.abort();
            return;
        }
    };
}

var page = require('webpage').create();
adblock( page );

If, for example, you wanted to prevent images from being loaded, you could define that adblock() to be:

function adblock( page ) {
    var regexpImg = new RegExp( '\.(jpe?g|png|gif|svg)(\?.*)?$', 'i' );

    page.onResourceRequested = function ( requestData, networkRequest ) {
        if ( regexpImg.test( requestData.url ) ) {
            console.error( "  - BLOCKED URL: " + requestData.url );
            networkRequest.abort();
            return;
        }
    };
}

How to Click on a Div or Span Using PhantomJS

What is PhantomJS?

PhantomJS is the V8 javascript engine combined with some JS (like Node.JS) to make a headless browser that is extremely useful for testing.

One of the best features is the ability to output the virtual display at any time to a PNG graphic file.

The Problem

I was struggling with how to click on a div or span element using PhantomJS.

Other elements could be clicked on by code not unlike the following (from imagebin.js included in the examples provided with the PhantomJS distribution):

page.open("http://imagebin.org/index.php?page=add", function () {
    page.evaluate( function () {
        document.querySelector('input[name=nickname]').value = 'phantom';
        document.querySelector('input[name=disclaimer_agree]').click()
        document.querySelector('form').submit();
    });
});

The problem is that the element returned by the document.querySelector() function for a div or span does not have a click() method.

The Solution

The solution was found this blog post and the following example should make this clear:

page.evaluate( function() {
    // find element to send click to
    var element = document.querySelector( 'span.control.critical.closer' );

    // create a mouse click event
    var event = document.createEvent( 'MouseEvents' );
    event.initMouseEvent( 'click', true, true, window, 1, 0, 0 );

    // send click to element
    element.dispatchEvent( event );
});

You might ask where I got “span.control.critical.closer” from. Well I use Google Chrome and manually load up the web page with the element I want to click. Then I click on the menu and select tools -> developer tools (alternatively press shift+ctrl+I). Then I click on the magnifying glass icon on the bottom of the screen and then click on the div or span element I want to click – and copy down the name.

Example Usage

I would use code like the following:

function mouseclick( element ) {
    // create a mouse click event
    var event = document.createEvent( 'MouseEvents' );
    event.initMouseEvent( 'click', true, true, window, 1, 0, 0 );

    // send click to element
    element.dispatchEvent( event );
}

function handle_page( page ) {
    page.evaluate(
        function( mouseclick_fn ) {
            var element = document.querySelector( "input#payConf" );
            mouseclick_fn( element );
        },
        mouseclick
    );

    window.setTimeout(
        function () {
            handle_click_reaction( page );
        },
        5000 // give page 5 seconds to process click
    );
}

Important!

Many users new to PhantomJS seem oblivious to the need to wait some time to allow the virtual browser to do whatever action the click intended to do. After you dispatch your mouse event you will either want to call window.setTimeout or implement a waitFor() type function (you can find such a function elsewhere on this blog).

If you expect to get results from your click instantly you will most likely be disappointed!